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Preface 



Data mining from traditional relational databases as well as from non-traditional 
ones such as semi-structured data, Web data and scientific databases such as bio- 
logical, linguistic and sensor data has recently become a popular way of discover- 
ing hidden knowledge. In the context of relational and traditional data, methods 
such as association rules, chi square rules, ratio rules, implication rules, etc. have 
been proposed in multiple, varied contexts. In the context of non-traditional 
data, newer, more experimental yet novel techniques are being proposed. There 
is an agreement among the researchers across communities that data mining is a 
key ingredient for success in their respective areas of research and development. 
Consequently, interest in developing new techniques for data mining has peaked 
and a tremendous stride is being made to answer interesting and fundamental 
questions in various disciplines using data mining. 

In the past, researchers mainly focused on algorithmic issues in data mining 
and placed much emphasis on scalability. Recently, the focus has shifted towards 
a more declarative way of answering questions using data mining that has given 
rise to the concept of mining queries. 

Data mining has recently been applied with success to discovering hidden 
knowledge from relational databases. Methods such as association rules, chi 
square rules, ratio rules, implication rules, etc. have been proposed in several 
and very different contexts. To cite just the most frequent and famous ones: the 
market basket analysis, failures in telecommunication networks, text analysis for 
information retrieval, Web content mining, Web usage, log analysis, graph min- 
ing, information security and privacy, and finally analysis of objects traversal by 
queries in distributed information systems. 

From these widespread and various application domains it results that data 
mining rules constitute a successful and intuitive descriptive paradigm able to of- 
fer complementary choices in rule induction. Other than inductive and abductive 
logic programming, research into data mining from knowledge bases has been 
almost non-existent, because contemporary methods place the emphasis on the 
scalability and efficiency of algorithmic solutions, whose inherent procedurality 
is difficult to cast into the declarativity of knowledge base systems. 

In particular, researchers convincingly argue that the ability to declaratively 
mine and analyze relational databases for decision support is a critical require- 
ment for the success of the acclaimed data mining technology. Indeed, DBMSs 
constitute today one of the most advanced and sophisticated achievements that 
applied computer science has made in the past years. Unfortunately, almost all 
the most powerful DBMSs we have today have been developed with a focus on 
On-Line Transaction-Processing tasks. Instead, database technology for On-Line 
Analytical-Processing tasks, such as data mining, is more recent and in need of 
further research. 
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Preface 



Although there have been several encouraging attempts at developing meth- 
ods for data mining using SQL, simplicity and efficiency still remain significant 
prerequisites for further development. It is well known that today database tech- 
nology is mature enough: popular DBMSs, such as Oracle, DB2 and SQL-Server, 
provide interfaces, services, packages and APIs that embed data mining algo- 
rithms for classification, clustering, association rules extraction and temporal 
sequences, such that they are directly available to programmers and ready to be 
called by applications. 

Therefore, it is envisioned that we should be able now to mine relational 
databases for interesting rules directly from database query languages, without 
any data restructuring or preprocessing steps. Hence no additional machineries 
with respect to database languages would be necessary. This vision entails that 
the optimization issues should be addressed at the system level for which we 
have now a significant body of research, while the analyst could concentrate 
better on the declarative and conceptual level, in which the difficult task of 
interpretation of the extracted knowledge occurs. Therefore, it is now time to 
develop declarative paradigms for data mining so that these developments can 
be exploited at the lower and system level, for query optimization. 

With this aim we planned this book on “Data Mining” with an emphasis 
on approaches that exploit the available database technology, declarative data 
mining, intelligent querying and associated issues such as optimization, indexing, 
query processing, languages and constraints. Attention is also paid to solution of 
data preprocessing problems, such as data cleaning, discretization and sampling, 
developed using database tools and declarative approaches, etc. 

Most of this book resulted also as a consequence of the work we conducted 
during the development of the cInQ project (consortium on discovering knowl- 
edge with Inductive Queries) an EU funded project (1ST 2000-26469) aiming 
at developing database technology for leveraging decision support systems by 
means of query languages and inductive approaches to knowledge extraction 
from databases. It presents new and invited contributions, plus the best papers, 
extensively revised and enlarged, presented during workshops on the topics of 
database technology, data mining and inductive databases at international con- 
ferences such as EDBT and PKDD/ECML, in 2002. 
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Volume Organization 



This volume is organized in two main sections. The former focuses on Database 
Languages and Query Execution, while the latter focuses on methodologies, tech- 
niques and new approaches that provide Support for Knowledge Discovery Pro- 
cess. Here, we briefly overview each contribution. 



Database Languages and Query Execution 

The first contribution is Inductive Databases and Multiple Uses of Frequent Item- 
sets: The cInQ Approach which presents the main contributions of theoretical 
and applied nature, in the field of inductive databases obtained in the cInQ 
project. 

In Query Languages Supporting Descriptive Rule Mining: A Comparative Study 
we provide a comparison of features of available relational query languages for 
data mining, such as DMQL, MSQL, MINE RULE, and standardization efforts for 
coupling database technology and data mining systems, such as OLEDB-DM and 
PMML. 

Declarative Data Mining Using SQL-3 shows a new approach, compared to 
existing SQL approaches, to mine association rules from an object-relational 
database: it uses a recursive join in SQL-3 that allows no restructuring or pre- 
processing of the data. It proposes a new mine by SQL-3 operator for capturing 
the functionality of the proposed approach. 

Towards a Logic Query Language for Data Mining presents a logic database lan- 
guage with elementary data mining mechanisms, such as user-defined aggregates 
that provide a model, powerful and general as well, of the relevant aspects and 
tasks of knowledge discovery. 

Data Mining Query Language for Knowledge Discovery in a Geographical In- 
formation System presents SDMOQL a spatial data mining query language for 
knowledge extraction from GIS. The language supports the extraction of clas- 
sification rules and association rules, the use of background models, various 
interestingness measures and the visualization. 

Towards Query Evaluation in Inductive Databases Using Version Spaces studies 
inductive queries. These ones specify constraints that should be satisfied by the 
data mining patterns in which the user is interested. This work investigates 
the properties of solution spaces of queries with monotonic and anti-monotonic 
constraints and their boolean combinations. 

The CUBA Method, Data Preprocessing and Mining surveys the basic principles 
and foundations of the CUBA method, the available systems and related works. 
This method originated in the Czechoslovak Academy of Sciences of Prague in 
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the mid 1960s with strong logical and statistical foundations. Its main principle 
is to let the computer generate and evaluate all the hypotheses that may be 
interesting given the available data and the domain problem. This work discusses 
also the relationships between the GUHA method and relational data mining and 
discovery science. 

Constraint Based Mining of First Order Sequences in SeqLog presents a logical 
language, SeqLog, for mining and querying sequential data and databases. This 
language is used as a representation language for an inductive database system. 
In this system, variants of level-wise algorithms for computing the version space 
of the solutions are proposed and experimented in the user-modeling domain. 



Support for Knowledge Discovery Process 

Interactivity, Scalability and Resource Control for Efficient KDD Support in 
DBMS proposes a new approach for combining preprocessing and data min- 
ing operators in a KDD-aware implementation algebra. In this way data mining 
operators can be integrated smoothly into a database system, thus allowing 
interactivity, scalability and resource control. This framework is based on the 
extensive use of pipelining and is built upon an extended version of a specialized 
database index. 

Frequent Itemset Discovery with SQL Using Universal Quantification investi- 
gates the integration of data analysis functionalities into two basic components 
of a database management system: query execution and optimization. It employs 
universal and existential quantifications in queries and a vertical layout to ease 
the set containment operations needed for frequent itemsets discovery. 

Deducing Bounds on the Support of Itemsets provides a complete set of rules 
for deducing tight bounds on the support of an itemset if the support of all its 
subsets are known. These bounds can be used by the data mining system to 
choose the best access path to data and provide a better representation of the 
collection of frequent itemsets. 

Model-Independent Bounding of the Supports of Boolean Formulae in Binary 
Data considers frequencies of arbitrary boolean formulas, a new class of aggre- 
gates: the summaries. These ones are computed for descriptive purposes on a 
sparse binary data set. This work considers the problem of finding tight upper 
bounds on these frequencies and gives a general formulation of the problem with 
a linear programming solution. 

Condensed Representations for Sets of Mining Queries proposes a general frame- 
work for condensed representations of sets of mining queries, defined by mono- 
tonic and anti-monotonic selection predicates. This work proves important for 
inductive and database systems for data mining since it deals with sets of queries, 
whereas previous work in maximal, closed and condensed representations treated 
so far the representation of a single query only. 
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IX 



One-Sided Instance-Based Boundary Sets introduces a family of version-space 
representations that are important for their applicability to inductive databases. 
They correspond to the task of concept learning from a database of exam- 
ples when this database is updated. One-sided instance-based boundary sets 
are shown to be correctly and efficiently computable. 

Domain Structures in Filtering Irrelevant Frequent Patterns introduces a no- 
tion of domain constraints, based on distance measures and in terms of domain 
structure and concept taxonomies. Domain structures are useful in the analysis 
of communications networks and complex systems. Indeed they allow irrelevant 
combinations of events that reflect the simultaneous information of independent 
processes in the same database to be pruned. 

Integrity Constraints over Association Rules investigates the notion of integrity 
constraints in inductive databases. This concept is useful in detecting inconsis- 
tencies in the results of common data mining tasks. This work proposes a form of 
integrity constraints called association map constraints that specifies the allowed 
variations in confidence and support of association rules. 
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Inductive Databases and Multiple Uses of 
Frequent Itemsets: The cInQ Approach 



Jean-Frangois Boulicaut 

Institut National des Sciences Appliquees de Lyon, 
LIRIS CNRS FRE 2672, Batiment Blaise Pascal 
F-69621 Villeurbanne cedex, France 



Abstract. Inductive databases (IDBs) have been proposed to afford the 
problem of knowledge discovery from huge databases. With an IDB the 
user/analyst performs a set of very different operations on data using 
a query language, powerful enough to perform all the required elabo- 
rations, such as data preprocessing, pattern discovery and pattern post- 
processing. We present a synthetic view on important concepts that have 
been studied within the cInQ European project when considering the 
pattern domain of itemsets. Mining itemsets has been proved useful not 
only for association rule mining but also feature construction, classifi- 
cation, clustering, etc. We introduce the concepts of pattern domain, 
evaluation functions, primitive constraints, inductive queries and solvers 
for itemsets. We focus on simple high-level definitions that enable to for- 
get about technical details that the interested reader will find, among 
others, in cInQ publications. 



1 Introduction 

Knowledge Discovery in Databases (KDD) is a complex interactive process which 
involves many steps that must be done sequentially. In the cInQ project^, we 
want to develop a new generation of databases, called “inductive databases” 
(IDBs), suggested by Imielinski and Mannila in [42] and for which a simple 
formalization has been proposed in [20] . This kind of databases integrate raw data 
with knowledge extracted from raw data, materialized under the form of patterns 
into a common framework that supports the knowledge discovery process within 
a database framework. In this way, the process of KDD consists essentially in a 
querying process, enabled by a query language that can deal either with raw data 
or patterns and that can be used throughout the whole KDD process across many 
different applications. A few query languages can be considered as candidates 
for inductive databases. For instance, considering the prototypical case of assoc- 



^ This research is partially funded by the Future and Emerging Technologies arm of 
the 1ST Programme FET-Open scheme (cInQ project IST-2000-26469). The author 
warmly ackuowledges all the contributors to the cInQ project and more particularly 
Luc De Raedt, Baptiste Jeudy, Mika Klemettinen, and Rosa Meo. 



R. Meo et al.(Eds.): Database Support for Data Mining Applications, LNAI 2682, pp. 1—23, 2004. 
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iation rule mining, [10] is a comparative evaluation of three proposals (MSQL [43], 
DMQL [38], and MINE RULE [59]) in the light of the IDEs’ requirements. 

In this paper, we focus on mining queries, the so-called inductive queries, i.e., 
queries that return patterns from a given database. More precisely, we consider 
the pattern domain of itemsets and databases that are transactional databases. 
Doing so, we can provide examples of concepts that have emerged as important 
within the cInQ project after 18 months of work. 

It is useful to abstract the meaning of mining queries. A simple model has 
been introduced in [55] that considers a data mining process as a sequence 
of queries over the data but also the so-called theory of the data. Given a 
language L of patterns (e.g., itemsets, sequences, association rules), the the- 
ory of a database r with respect to L and a selection predicate q is the set 
Th{r,L,q) = {4> & L \ q{r,4>) is true}. The predicate q indicates whether a pat- 
tern (f> is considered interesting (e.g., (f> denotes a property that is “frequent” in 
r). The selection predicate can be defined as a combination (boolean expression) 
of primitive constraints that have to be satisfied by the patterns. Some of them 
refer to the “behavior” of a pattern in the data, e.g., its “frequency” in a given 
data set is above or below a user-given threshold, some others define syntactical 
restrictions on desired patterns, e.g., its “length” is below a user-given thresh- 
old. Preprocessing concerns the definition of the database r, the mining phase 
is often the computation of the specified theory while post-processing can be 
considered as a querying activity on a materialized theory or the computation 
of a new theory. 

This formalization however does not reflect the context of many classical 
data mining processes. Quite often, the user is interested not only in a collection 
of patterns that satisfy some constraints (e.g., frequent patterns, strong rules, 
approximate inclusion or functional dependencies) but also to some properties 
of these patterns in the selected database (e.g., their frequencies, the error for 
approximate dependencies). In that case, we will consider the so-called extended 
theories. For instance, when mining frequent itemsets or frequent and valid as- 
sociation rules [2], the user needs for the frequency of the specified patterns or 
rules. Indeed, during the needed post-processing phase, the user/analyst often 
uses various objective interestingness measures like the confidence [2], the con- 
viction [23] or the J-mesure [72] that are computed efficiently provided that the 
frequency of each frequent itemset is available. Otherwise, it might be extremely 
expensive to look at the data again. 

Designing solvers for more or less primitive constraints concerns the core of 
data mining algorithmic research. We must have solvers that can compute the 
(extended) theories and that have good properties in practice (e.g., scalability 
w.r.t. the size of the database or the size of the search space). A “generate 
and test” approach that would enumerate the sentences of C and then test 
the selection predicate q is generally impossible. A huge effort has concerned 
a clever use of the constraints occurring in q to have a tractable evaluation 
of useful inductive queries. This is the research area of constraint-based data 
mining. Most of the algorithmic research in pattern discovery tackles the design 
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of complete algorithms for computing (extended) theories given more or less 
specific conjunctions of primitive constraints. Typically, many researchers have 
considered the computation of frequent patterns, i.e., patterns that satisfy a 
minimal frequency constraint. An important paper on a generic algorithm for 
such a typical mining task is [55] . However, if the active use of the so-called anti- 
monotonic constraints (e.g., the minimal frequency) is now well-understood, the 
situation is far less clear for non anti-monotonic constraints [64,51,18]. 

A second major issue is the possibility to approximate the results of (ex- 
tended) inductive queries. This approximation can concern a collection of pat- 
terns that is a superset or a subset of the desired collection. This is the typical 
case when the theories are computed from a sample of the data (see, e.g., [74]) 
or when a relaxed constraint is used. Another important case of approximation 
for extended theories is the exact computation of the underlying theory while 
the evaluation functions are only approximated. This has lead to an important 
research area, the computation of the so-called condensed representations [54], a 
domain in which we have been playing a major role since the study of frequent 
closed itemsets as an e-adequate representation for frequency queries [12]. 

This paper is organized as follows. Section 2 introduces notations and defi- 
nitions that are needed for discussing inductive queries that return itemsets. It 
contains an instance of the definition of a pattern domain. Section 3 identifies 
several important open problems. Section 4 provides elements of solution that 
are currently studied within the cInQ project. Section 5 is a short conclusion. 



2 A Pattern Domain for Itemsets 

The definition of a pattern domain is made of the definition of a language of 
patterns C, evaluation functions that assign a semantics to each pattern in a 
given database r, languages for primitive constraints that specify the desired 
patterns, and inductive query languages that provide a language for combining 
the primitive constraints. 

We do not claim that this paper is an exhaustive description of the itemset 
pattern domain. Even though we selected representative examples of evaluation 
functions and primitive constraints, many others have been or might be defined 
and used. 

2.1 Language of Patterns and Terminology 

We introduce some notations that are used for defining the pattern domain of 
itemsets. In that context, we consider that: 

— A so-called transactional database contains the data, 

— Patterns are the so-called itemsets and one kind of descriptive rule that can 
be derived from them, i.e., the association rules. 

Definition 1 (Transactional Databases). Assume that Items is a finite set 
of symbols denoted by capital letters, e.g., Items= {A, B,C, ...}. A transaction 
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t is a subset of Items. A transactional database r is a finite and non empty 
multiset r = {ti,t 2 , ■ ■ ■ ,t„} of transactions. 

Typical examples of transactional databases concern basket data (transac- 
tions are sets of products that are bought by customers), textual data (transac- 
tions are sets of keywords or descriptors that characterize documents), or gene 
expression data (transactions are sets of genes that are over-expressed in given 
biological conditions) . 

Definition 2 (Itemsets). An itemset is a subset of Items. The language of 
patterns for itemsets is L = . 

We often use a string notation for sets, e.g., AB for {A,B}. Figure 1 provides 
an example of a transactional database and some information about itemsets 
within this database. 

Association rules are not only a classical kind of pattern derived from itemsets 
[1,2] but are also used for some important definitions. 

Definition 3 (Association Rules). An association rule is denoted X ^ Y 
where X C\Y = % and X C Items is the body of the rule and Y C Items is the 
head of the rule. 

Let us now define constraints on itemsets. 

Definition 4 (Constraint). IfT denotes the set of all transactional databases 
and the set of all itemsets, an itemset constraint C is a predicate over 

2 items ^ itemset S G 2^^®“® satisfies a constraint C in the database r G T 

iff C{S,v) = true. When it is clear from the context, we write C{S). Given a 
subset I of Items, we define SATc(I) = {S € I, S satisfies C}. SATc denotes 
SATc(2^'^®“'®). The same definitions can be easily extended to rules. 

2.2 Evaluation Functions 

Evaluation functions return information about the properties of a given pattern 
in a given database. Notice that using these evaluation functions can be consid- 
ered as a useful task for the user/analyst. It corresponds to hypothesis testing 
when hypothesis can be expressed as itemsets or association rules, e.g., what 
are the transactions that support the H hypothesis? How many transactions 
support HI Do I have less than n counter-examples for hypothesis HI 

Several evaluation functions are related to the “satisfiability” of a pattern in 
a given data set, i.e., deciding whether a pattern hold or not in a given database. 

Definition 5 (Support for Itemsets and Association Rules). A transac- 
tion t supports an itemset X if every item in X belongs to t, i.e., S Ct. It is then 
possible to define a boolean evaluation function ei such that ei(A, r) is true if 
all the transactions in r support X and false elsewhere. The same definition can 
be adapted for association rules: ei(A Y,r) returns true iff when r supports 
X, it supports Y as well. The support (denoted support(S', r) ) of an itemset S is 
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the multiset of all transactions ofr that supports S (e.g., support(0) =v). The 
support of a rule is defined as the support of the itemset X UY . A transaction 
t supports a rule X if it supports X UY . 

Definition 6 (Exceptions to Rules). A transaction t is an exception for a 
rule X ^ Y if it supports X and it does not support Y. It is then possible to 
define a new boolean evaluation function 62 (X Y,r) that returns true if none 
of the transactions t in v is an exception to the rule X ^ Y . A rule with no 
exception is called a logical rule. 

These evaluation functions that return sets of transactions or boolean values 
are useful when crossing over the patterns and the transactional data. Also, the 
size of the supporting set is often used. 

Definition 7 (Frequency). The absolute frequency of an itemset S in r is 
defined by Ta{S,v) = |support(5')| where |.| denote the cardinality of the multiset 
(each transaction is counted with its multiplicity) . The relative frequency of S in 
r is iF{S,r) = |support(5')|/|support(0)|. When there is no ambiguity from the 
context, parameter r is omitted and the frequency denotes the relative frequency 
(i.e., a number in [0,1]). 

Figure 1 provides an example of a transactional database and the supports 
and the frequencies of some itemsets. 





ABCD 


t2 


BC 


CO 


AC 


t4 


AC 


^5 


ABCD 




ABC 



Itemset 


Support 


Frequency 


A 


{tijtsjti, t5,te} 


0.83 


B 


{tl, t2, ts, A} 


0.67 


AB 


{tl ,^5,^6} 


0.5 


AC 


ts,te} 


0.83 


CD 


{ti,ts} 


0.33 


ACD 


{iiTs} 


0.33 



Fig. 1. Supports and frequencies of some itemsets in a transactional database 



Other measures might be introduced for itemsets that, e.g., returns the de- 
gree of correlation between the attributes it contains. We must then provide 
evaluation functions that compute these measures. 

It is straightforward to define the frequency evaluation function of an asso- 
ciation rule A y in r as T{X Y,r) = T{X U F, r) [1,2]. When mining 
association rules, we often use objective interestingness measures like confidence 
[2], conviction [23], J-mesure [72], etc. These can be considered as new eval- 
uation functions. Most of these measures can be computed from the frequen- 
cies of rule components. For instance, the confidence of a rule A => T in r is 
conf(A y) = A(A y,r)/A(y,r). It gives the conditional probability that 
a transaction from r supports A U y when it supports A. The confidence of a 
logical rule is thus equal to 1. 

We consider now several evaluation functions that have been less studied in 
the data mining context but have been proved quite useful in the last 3 years 
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(see, e.g., [65,12,75,15,16]). Notice however that these concepts have been used 
for a quite a long time in other contexts, e.g., in concept lattices. 

Definition 8 (Closures of Itemsets). The closure of an itemset S in r (de- 
noted by closure(S', r) j is the maximal (for set inclusion) superset of S which 
has the same support as S. In other terms, the closure of S is the set of items 
that are common to all the transactions which support S. 

Notice that when the closure of an itemset X is a proper superset of X, say 
Y, it means that an association rule X ^ Y \ X holds in r with confidence 1. 

Example 1. In the database of Figure 1, let us compute closure(AB). Items A 
and B occur in transactions 1, 5 and 6. Item C is the only other item that is also 
present in these transactions, thus closure(AB) = ABC. Also, closure(A) = AC, 
closure(B) = BC, and closure(BC) = BC. 

We now introduce an extension of this evaluation function [15,16]. 

Definition 9 (^-closure). Let 6 he an integer and S an itemset. The 6 -closure 
of S, closure^(S') is the maximal (w.r.t. the set inclusion) superset Y of S such 
that for every item A € Y — S, |Support(S' U {A})[ is at least |Support(5')| — 6. 
In other terms, lFa(closure5(5')) has almost the same value than Ea{S) when 6 
is small w.r.t. the number of transactions. 

Example 2. In the database of Figure 1, closure 2 (B) = BCD while closureo(B) = 
BC. 



Notice that closureo = closure. Also, the Aclosure of a set X provides an 
association rule with high confidence between X and closure^(A) \ X when 6 
is a positive integer that is small w.r.t. the number of transactions. 

It is of course possible to define many other evaluation functions. We gave 
representative examples of such functions and we now consider examples of prim- 
itive constraints that can be built from them. 

2.3 Primitive Constraints 

Many primitive constraints can be defined. We consider some examples that have 
been proved useful. These examples are representative of two important kinds of 
constraints: constraints based on evaluation functions and syntactic constraints. 
These later can be checked without any access to the data and are related to the 
well known machine learning concept of linguistic bias (see, e.g., [63]). 

Let us consider primitive constraints based on frequency. First, we can enforce 
that a given pattern is frequent enough (Cminfreq(<S')) and then we specify that a 
given pattern has to be infrequent or not too frequent (Cmaxfreq(5')). 

Definition 10 (Minimal Frequency). Given an itemset S and a frequency 
threshold 7 G [0, 1], Cniinfreq(>5') = X{S) > 7. Itemsets that satisfy Cminfreq are 
said 7-frequent or frequent in r. Indeed, this constraint can he defined also on 
association rules: Cminfreq(-A T) = 1F(A T) > 7. 
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Definition 11 (Maximal Frequency). Given an itemset S and a frequency 
threshold 7 G [0,1], Cmaxfreq('S') = .^(>5') < 7 - Indeed, this constraint can he 
defined also on association rules. 

Definition 12 (Minimal Confidence on Rules). Given a rule X and 
a confidence threshold 9 G [0,1], Cminconf(-^ =7 F) = conf(S') > 9. Rules that 
satisfy Cminconf Called valid rules. A dual constraint for maximal confidence 
might be introduced as well. 

Example 3. Considering the database of Figure 1, if Cminfreq specifies that an 
itemset (or a rule) must be 0.6-frequent, then = {A, B, C, AC, BC}. For 

rules, if the confidence threshold is 0.7, then the frequent and valid rules are 

It is straightforward to generalize these definitions to all the other insterest- 
ingness measures that we can use for itemsets and rules. However, let us notice 
that not all the interestingness measures have bounded domain values. It moti- 
vates the introduction of the optimal constraints. 

Definition 13 (Optimality). Given an evaluation function E that returns an 
ordinal value, let us denote by Copt{E,4>,n) the constraint that is satisfied if (f> 
belongs to the n best patterns according to £ values (the n patterns with the 
highest values). 

For instance, such a constraint can be used to specify that only the n most 
frequent patterns are desired (see [7,70,71] for other examples). 

Another kind of primitive constraint concerns the syntactical restrictions that 
can be defined on one pattern. By syntactical, we mean constraints that can 
be checked without any access to the data and/or the background knowledge, 
just by looking at the pattern. A systematic study of syntactical constraints for 
itemsets and rules has been described in [64,51]. 

Definition 14 (Syntactic Constraints). It is of the form S G Lc, where 
Cc C £ = Various means can he used to specify Lc, e.g., regular expres- 

sions. 

Some other interesting constraints can use additional information about the 
items, i.e., some background knowledge encoded in, e.g., relational tables. In 
[64], the concept of aggregate constraint is introduced. 

Definition 15 (Aggregate Constraint). It is of the form agg(S)9v, where 
agg is one of the aggregate functions min, max, sum, count, avg, and 9 is one 
of the boolean operators =, yf, <, <, >, >. It says the aggregate of the set of 
numeric values in S stands in relationship 9 to v. 

Example 4- Consider the database of Figure 1, assume that Csize{S) = [S'] < 2 
(it is equivalent to count{S) < 2) and Cmiss{S) = B ^ S', then SATc^i^^ = 
{ 0 , A, B, C, D, AB, AC, AD, BC, BD, CD} and SATc„,,, = { 0 , A, C, D, AC, AD, ACD}. 
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The same kind of syntactical constraint can be expressed on association rules, 
including the possibility to express constraints on the body and/or the head of 
the rule. 

We now consider primitive constraints based on closures. 

Definition 16 (Closed Itemsets and Constraint Cdose)- A closed itemset 
is an itemset that is equal to its closure in r. Let us assume that Cciose(S) = 
closure(5') = S. In other terms, closed itemsets are maximal sets of items that 
are supported by a multiset of transactions. 

Example 5. In the database of Figure 1, the closed itemsets are C, AC, BC, ABC, 
and ABCD. 

Free itemsets are sets of items that are not “strongly” correlated [15]. They 
have been designed as a useful intermediate representation for computing closed 
sets since the closed sets are the closures of the free sets. 

Definition 17 (Flee Itemsets and Constraint Cfree)- An itemset S is free if 
no logical rule holds between its items, i.e., it does not exist two distinct itemsets 
X , Y such that S = X [JY , Y ^ % and X ^ Y is a logical rule. 

Example 6. In the database of Figure 1, the free sets are 0, A, B, D, and AB. 

An alternative definition is that all the proper subsets of a free set S have 
a different frequency than S. Notice that free itemsets have been formalized 
independently as the co-called key patterns [5]. Furthermore, the concept of free 
itemset formalizes the concept of generator [65] in an extended framework since 
free itemsets are a special case of Afree itemsets [15,16]. 

Definition 18 (^-free Itemsets and Constraint Cs-free)- Let 6 be an in- 
teger and S an itemset, an itemset S is 6-free if no association rule with at 
most 6 exceptions holds between its subsets. 6-free sets satisfy the constraint 
Cs-free(S) = (VS" C S) S ^ closure^ (S') . 

Example 1. In the database of Figure 1, the 1-free sets are 0, A, B, and D. 

2.4 Example of Inductive Queries 

Now, it is interesting to consider boolean combinations of primitive constraints. 
Notice that in this section, we consider neither the problem of query evaluation 
nor the availability of concrete query languages. 

The Standard Association Rule Mining Task. Mining the frequent item- 
sets means the computation of SATc„,i„f„q for a given frequency threshold. The 
standard association rule mining problem introduced in [1] is to find all the 
association rules that verify the minimal frequency and minimal confidence con- 
straints for some user-defined thresholds. In other terms, we are looking for 
each pattern (f> (rules) such that Cminfreq(</>) A Cniinconf(/’) is true. Filtering rules 
according to syntactical criteria can also be expressed by further constraints. 
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Example 8. Provided the dataset of Figure 1 and the constraints from Exam- 
ple 3 and 4, = {A,C,AC} is returned when the query 

specifies that the desired itemsets must be 0.6-frequent, with size less than 
3 and without the attribute B. It is straightforward to consider queries on 
rules. Let us consider an example where the user/analyst wants all the frequent 
and valid association rules but also quite restricted rules with high confidence 
(but without any minimal frequency constraint). Such a query could be based, 
e.g., on the constraint (Cminfreq(</') A Cminconf (<(’)) V A Cminconf(0)) where 

Cs{X ^ Y) = \X\ = \Y\ = 1 AY = A. 

Mining Discriminant Patterns. An interesting application of frequency con- 
straints concerns the search for patterns that are frequent in one data set and 
infrequent in another one. This has been studied in [33] as the emerging pat- 
tern mining task. More recently, it has been studied within an inductive logic 
programming setting in [32] and applied to molecular fragment discovery. 

Assume a transactional database for which one of the item defined a class 
value (e.g., item A is present when the transaction has the class value “inter- 
esting” and false when the transaction has the class value “irrelevant”). It is 
then possible to split the database r into two databases, the one of interesting 
transactions ri and the one of irrelevant transactions (say r2). Now, a useful min- 
ing task concerns the computation of every itemset such that Cminfreq(>5', ri) A 
Cmaxfreq(5', T2). Indeed, these itemsets are supported by interesting transactions 
and not supported by irrelevant ones. Thresholds can be assigned thanks to a 
statistical analysis and such patterns can be used for predictive mining tasks. 

Mining Association Rules with Negations. Let Items+ = {A, B,...} be a 

finite set of symbols called the positive items and a set Items' of same cardinal- 
ity as Items+ whose elements are denoted A, B, . . . and called the negative items. 
Given a transaction database r over Iterns^, let us define a complemented trans- 
action database over Items = Items+UItems' as follows: for a given transaction 
t G r, we add to t negative items corresponding to positive items not present 
in t. Generalized itemsets are subsets of Items and can contain positive and 
negative items. In other terms, we want to have a symmetrical impact for the 
presence or the absence of items in transactions [14]. It leads to extremely dense 
transactional databases, i.e., extremely difficult extraction processes. 

In [13], the authors studied the extraction of frequent itemsets (Cminfreq) that 
do not involve only negative items {Caipp}- Caipp(S) is true when S involves at 
least p positive items. Also, this constraint has been relaxed into Caippoamin = 
Caipp V Camin (at least p positive attributes or at most 1 negative attribute). On 
different real data sets, it has been possible to get interesting results when it was 
combined with condensed representations (see Section 4). 

Mining Condensed Representation of Ftequent Itemsets. Gondensed 
representation is a general concept (see, e.g., [54]) that can be extremely useful 
for the concise representation of the collection of frequent itemsets and their 
frequencies. In Section 3, we define more precisely this approach. Let us no- 
tice at that stage that several algorithms exist to compute various condensed 
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representations of the frequent itemsets: Close [65], Closet[69], Charm [75], 
Min-Ex [12,15,16], or Pascal [5]. These algorithms compute different condensed 
representations: the frequent closed itemsets (Close, Closet, Charm), the fre- 
quent free itemsets (Min-Ex, Pascal), or the frequent Wree itemsets for Min- 
Ex. From an abstract point of view, these algorithms are respectively looking 

for itemsets that satisfy Fniinfreq ^ ^close: flminfreq ^ ^free^ and Fniinfreq ^ ^6— free- 

Furthermore, it can be interesting to provide association rules whose compo- 
nents satisfy some constraints based on closures. For instance, association rules 
that are based on free itemsets on their left-hand side {Cfree(BODY)) and their 
closures on the right-hand side {Cdose{BODY U HEAD)) are of a particular 
interest: they constitute a kind of cover for the whole collection of frequent and 
valid association rules [4,8]. Notice also that the use of classification rules based 
on 5-free sets (Cs-free) for the body and a class value in the head has been 
studied in [17,28]. 

Postprocessing Queries. Post-processing queries can be understood as queries 
on materialized collections of itemsets or association rules: the user selects the 
itemsets or the rules that fulfill some new criteria (while these itemsets or rules 
have been mined, e.g., they are all frequent and valid). However, from the spec- 
ification point of view, they are not different from data mining queries even 
though the evaluation does not need an extraction phase and can be performed 
on materialized collections of itemsets or rules. 

One important post-processing use is to cross over the patterns and the data, 
e.g., when looking at transactions that are exceptions to some rules. For instance, 
given an association rule A ^ B, one wants all the transactions t from r (say 
the transactional database ri for which e 2 {A B, t) is true. Notice that r \ri is 
the collection of exceptions to the rule. A rule mining query language like MSQL 
[43] offers a few built-in primitives for rule post-processing, including primitives 
that cross-over the rules and the transactions (see also [10] in this volume for 
examples of post-processing queries) . 

3 A Selection on Some Open Problems 

We consider several important open problems that are related to itemset and 
rule queries. 

3.1 Tractability of Ftequent Itemset and Association Rule Mining 

Computing the result of the classical association rule mining problem is generally 
done in two steps [2] : first the computation of all the frequent itemsets and their 
frequency and then the computation of every valid association rule that can be 
made from disjoint subsets of each frequent itemset. This second step is far less 
expensive than the first one because no access to the database is needed: only the 
collection of the frequent itemsets and their frequencies are needed. Furthermore, 
the frequent itemsets can be used for many other applications, far beyond the 
classical association rule mining task. Notice among others, clustering (see, e.g., 
[61]), classification (see, e.g., [53,28]), generalized rule mining (see, e.g., [54,14]). 
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Computing the frequent itemsets is an important data mining task that has 
been studied by many researchers since 1994. The famous Apriori algorithm 
[2] has inspired many research and efficient implementations of ApRiORi-like 
algorithms can be used provided that the collection of the frequent itemsets is 
not too large. In other terms, for the desired frequency threshold, the size of the 
maximal frequent itemsets must not be too long (around 15). Indeed, this kind of 
algorithm must count the frequencies of at least every frequent itemset and useful 
tasks, according to the user/analyst, become intractable as soon as the size of 
is too large for the chosen frequency threshold. It is the case of dense 
and correlated datasets and many real datasets fall in this category. In Section 
4, we consider solutions thanks to the design of condensed representations for 
frequent itemsets. 

The efficiency of the extraction of the answer to an itemset query relies on the 
possibility to use constraints during the itemset computation. A classical result 
is that effective safe pruning can be achieved when considering anti-monotonic 
constraints [55,64], e.g., the minimal frequency constraint. It relies on the fact 
that if an itemset violates an anti-monotonic constraint then all its supersets 
violate it as well and therefore this itemset and its supersets can be pruned and 
thus not considered for further evaluation. 

Definition 19 (Anti-monotonicity). An anti-monotonic constraint is a con- 
straint C such that for all itemsets S, S' : {S' C S A C{S)) C{S'). 



Example 9. Examples of anti-monotonic constraints are: Cminfreq(5'), C{S) = 
S, C{S) = S C {A,B,C}, C(^) = S^{A,B,C} = 0, Cfree{S), Camln{S), 
Cs-free{S), C{S) = S.pricc > 50 and C{S) = Sum{S.price) < 500. The two last 
constraints mean respectively that the price of all items must be lower than fifty 
and that the sum of the prices of the items must be lower than five hundred. 

Notice that the conjunction or disjunction of anti-monotonic constraints is 
anti-monotonic. 

Even though the anti-monotonic constraints, when used actively, can dras- 
tically reduce the search space, it is not possible to ensure the tractability of 
an inductive query evaluation. In that case, the user/analyst has to use more 
selective constraints, e.g., a higher frequency threshold. Indeed, a side-effect can 
be that the extracted patterns become not enough interesting, e.g., they are so 
frequent that they correspond to trivial statements. 

Furthermore, itemset queries do not involve only anti-monotonic constraints. 
For instance, Cdose is not anti-monotonic. Sometimes, it is possible to post- 
process the collection of itemsets that satisfy the anti-monotonic part of the 
selection predicate to check the remaining constraints afterwards. 

3.2 Tractability of Constraint-Based Itemset Mining 

Pushing constraints is useful for anti-monotonic ones. Other constraints can be 
pushed like the monotonic constraints or the succinct constraints [64] . 
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Definition 20 (Monotonicity). A monotonic constraint is a constraint C such 
that for all itemsets S, S' : (S C S' A S satisfies C) ^ S' satisfies C. 

The negation of an anti-monotonic constraint is a monotonic constraint and 
the conjunction or disjunction of monotonic constraints is still monotonic. 

Example 10. C{S) = {A,B^C,D} C S, Caipp(S), C{S) = Sum{S.price) > 100 
(the sum of the prices of items from S is greater than 100) and C{S) = S C] 
{A, B, C} 7 ^ 0 are examples of monotonic constraints. 

Indeed, monotonic constraints can also be used to improve the efficiency of 
itemset extraction (optimization of the candidate generation phase that prevents 
to consider candidates that do not satisfy the monotonic constraint (see, e.g., 
[18]). 

The succinctness property that has been introduced in [64] are syntactic 
constraints that can be put under the form of a conjunction of monotonic and 
anti-monotonic constraints. Clearly, it is possible to use such a property for 
the optimization of the constraint-based extraction (optimization of candidate 
generation and pruning). 

Pushing non anti-monotonic constraints sometimes increases the computation 
times since it prevents effective pruning based on anti-monotonic constraints 
[73,18,34]. For instance, as described in [13], experiments have shown that it 
has been needed to relax the monotonic constraint Caipp (“pushing” it gave 
rise to a lack of pruning) by Caippoamin = Caipp V Camin where Camin is anti- 
monotonic. The identification of a good strategy for pushing constraints needs 
for an a priori knowledge of constraint selectivity. However, this is in general 
not available at extraction time. Designing adaptative strategies for pushing 
constraints during itemset mining is still an open problem. Notice however that 
some algorithms have been already proposed for specific strategies on itemset 
mining under conjunctions of constraints that are monotonic and anti-monotonic 
[64,18]. This has been explored further within the cInQ project (see Section 4). 

Notice also that the constraints defined by a user on the desired association 
rules have to be transformed into suitable itemset constraints. So far, this has 
to be done by an ad-hoc processing and designing semi-automatic strategies for 
that goal is still an open problem. 

3.3 Interactive Itemset Mining 

From the user point of view, pattern discovery is an interactive and iterative pro- 
cess. The user defines a query by specifying various constraints on the patterns 
he/she wants. When a discovery process starts, it is difficult to figure out the 
collection of constraints that leads to an interesting result. The result of a data 
mining query is often unpredictable and the users have to produce sequences of 
queries until he/she gets an actionable collection of patterns. So, we have not 
only to optimize single inductive query evaluations but also the evaluation of 
sequences of queries. This has been studied for itemsets and association rules in, 
e.g., [37,3,36,62]. 
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One classical challenge is the design of incremental algorithms for computing 
theories (e.g., itemsets that satisfy a complex constraint) or extended theories 
(e.g., itemsets and their frequencies) when the data changes. More generally, 
reusing previously computed theories to answer more efficiently to new inductive 
queries is important. It means that results about, equivalence and containment 
of inductive queries are needed. Furthermore, the concept of dominance has 
emerged [3]. In that case, only data scans are needed to update the value of the 
evaluation functions. 

However, here again, a trade-off has to be found between the optimization of 
single queries by the best strategy for pushing its associated constraints and the 
optimization of the whole sequence. Indeed, the more we push the constraint and 
materialize only the constrained collection, the less it will be possible to reuse it 
for further evaluations [37,36]. In Section 4, we refer to recent advances in that 
area. 

3.4 Concrete Query Languages 

There is no dedicated query languages for itemsets but several proposal exist for 
association rule mining, e.g., (MSQL [43], DMQL [38], and MINE RULE [59]). Among 
them, the MINE RULE query language is one of the few proposals for which a for- 
mal operational semantics has been published [59] . Ideally, these query languages 
must support not only the selection of the mining context and its pre-processing 
(e.g., sampling, selection and grouping, discretization), the specification of a 
mining task (i.e., the expression of various constraints on the desired rules), and 
the post-processing of these rules (e.g., the support of subjective interestingness 
evaluation, redundancy elimination and grouping strategies, etc.). 

A comparative study of the available concrete query languages is published 
in this volume [10]. It illustrates that we are still lacking from an “ideal” query 
language for supporting KDD processes based on association rules. From our per- 
spective, we are still looking for a good set of primitives that might be supported 
by such languages. Furthermore, a language like MINE RULE enables to use vari- 
ous kinds of constraints on the desired rules (i.e., the relevant constraints on the 
itemsets are not explicit) and optimizing the evaluation of the mining queries (or 
sequences of queries) still need further research. This challenge is also considered 
within the cInQ project. 



4 Elements of Solution 

We now provide pointers to elements of solution that have been studied by the 
cInQ partners these last 18 months. It concerns each of the issues we have 
been discussing in Section 3. Even though the consortium has not studied only 
itemsets but also molecular fragments, sequential patterns and strings, the main 
results can be illustrated on itemsets. 

Let us recall that, again, we do not claim that this section considers all 
the solutions studied so far. Many other projects and/or research groups are 
interested in the same open problems and study other solutions. This is the 
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typical case for depth-first algorithms like [39] which opens new possibilities for 
efficient constraint-based itemset computation [67]. 

Let us first formalize that inductive queries that return itemsets might also 
provide the results of the frequencies for further use. 

Definition 21 (Itemset Query). A itemset query is a pair (C,r) where r is 
a transactional database and C is an itemset constraint. The result of a query 
Q = (C,r) is defined as the set Res{Q) = {{S,iF{S)) \ S € SATc\. 

There are two main approaches for the approximation of Res{Q)\ 

— The result is Approx{Q) = {{S,iF{S)) \ S € SATc'} where C yf C. In that 
case, Approx{Q) and Res{Q) are different. When C is more selective in r 
than C', we have Approx{Q) C Res{Q). A post-processing on Approx{Q) 
might be used to eliminate itemsets that do not verify C. When C is less 
selective than C then Appyrox{Q) is said incomplete. 

— The result is Approx{Q) = {(S', IF'(S')) j S € SATc} where T' provides an 
approximation of the frequency of each itemset in Approx{Q). 

Indeed, it can be so that the two situations occur simultaneously. A typical 
case is the use of sampling on the database: one can sample the database (r' C r is 
the mining context) and compute Res{Q) not in r but in r'. In that case, both the 
collection of the frequent itemsets and their frequencies are approximated. Notice 
however that clever strategies can be used to avoid, in practice, an incomplete 
answer [74]. 

A classical result is that it is possible to represent the collection of the fre- 
quent itemsets by its maximal elements, the so-called positive border in [55] or 
the S set in the machine learning terminology [60] . Also, it is possible to compute 
these maximal itemsets and their frequencies without computing every frequency 
of every frequent itemsets (see, e.g., [6]). This can be generalized to any anti- 
monotonic constraint: the collection of the most specific sentences Approx{Q) 
(e.g., the maximal itemsets) is a compact representation of Res(Q) from which 
(a) it is easy to derive the exact collections of patterns (every sentence that 
is more general belongs to the solution, e.g., every subset of the maximal fre- 
quent itemsets) but, (b) the evaluation functions (e.g., the frequency) are only 
approximated. Thus, the maximal itemsets can be considered as an example of 
an approximative condensed representation of the frequent itemsets. 

First, we have been studying algorithms that compute itemsets under more 
general constraints, e.g., conjunctions of anti-monotonic and monotonic con- 
straints. Next, we have designed other approximative condensed representations 
and exact ones as well. 



4.1 Algorithms for Constraint-Based Mining 

cInQ partners have studied the extraction of itemsets (and rather similar pat- 
tern domains like strings, sequences or molecular fragments) under a conjunction 
of monotonic and anti-monotonic constraints. Notice also that since disjunctions 
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of anti-monotonic (resp. monotonic) constraints are anti-monotonic (resp. mono- 
tonic), it enables to consider rather general forms of inductive queries. 

[46] provides a generic algorithm that generalizes previous work for constraint- 
based itemset mining in a levelwise approach (e.g., [73,64]). The idea is that, 
given a conjunction of an anti-monotonic constraint and a monotonic constraint 
(Cam A Cm), it is possiblc to start a levelwise search from the minimal (w.r.t. 
set inclusion) itemsets that satisfy Cm and completes this collection until the 
maximal itemsets that satisfy the Cam constraint are reached. Such a levelwise 
algorithm provides the complete collection Res{Q) when Q can be expressed by 
means of a conjunction Cam ^Cm- [46] introduces strategies (e.g., for computing 
the minimal itemsets that satisfy Cm by using the duality between monotonic 
and anti-monotonic constraints). Details are available in [44]. 

Mining itemsets under Cam A Cm can also be considered as a special case of 
the general algorithm introduced in [30]. This paper considers queries that are 
boolean expressions over monotonic and anti-monotonic primitives on a single 
pattern variable (j). This is a quite general form of inductive query and it is 
shown that the solution space corresponds to the union of various version spaces 
[60,57,40,41]. Because each version space can be represented in a concise way 
using its border sets S and G, [30] shows that the solution space of a query can be 
represented using the border sets of several version spaces. When a query enforces 
a conjunction Cam /\Cm, [30] proposes to compute S{Cam ^Cm) as {s € S(Cam) | 
3g G G(C m) . g G s} and dually for Gi^Cam ^Cm) • Thus, the borders for Cam ^Cm 
can be computed from S{Cam) and from G{Cm) as usual for the classical version 
space approach. Sets such as S{Cam) can be computed using classical algorithms 
such as the levelwise algorithm [55] and the dual set G{Cm) can be computed 
using the dual algorithms [32] . These border sets are an approximative condensed 
representation of the solution. For the MolFea specific inductive database, it 
has been proved quite effective for molecular fragment finding [49,32,50,48]. 

Sequential pattern mining has been studied as well. Notice that molecular 
fragments can be considered as a special case of sequences or strings. [27] stud- 
ies sequential pattern mining under a specific conjunction of constraint that ask 
for minimal frequency and similarity w.r.t. a reference pattern. In this work, 
the main contribution has been to relax the similarity constraint into an anti- 
monotonic one to improve pruning efficiency. It is an application of the frame- 
work for convertible constraints [68]. Also, logical sequence mining under con- 
straints has been studied, in a restricted framework [56] (regular expressions on 
the sequence of predicate symbols and minimal frequency) and in a more general 
setting [52] (conjunction of anti-monotonic and monotonic constraints). 

4.2 Condensed Representations for Frequent Itemsets 

cInQ partners have studied the condensed approximations in two complemen- 
tary directions: the use of border sets in a very general setting (i.e., version 
spaces) but also several condensed representations of the frequent itemsets. 

— Border sets represent the maximally general and/or maximally specific so- 
lutions to an inductive query. They can be used to bound the set of all 
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solutions [30] . This can be used in many different pattern domains provided 
that the search space is structured by a specialization relation and that the 
solution space is a version space. They are useful in case only membership 
of the solution set is important. 

— Closed sets [65,12], 5-free sets [15,16], and disjoint-free sets [25] are condensed 
representations that have been designed as e-adequate representations w.r.t. 
frequency queries, i.e., representations from which the frequency of any item- 
set can be inferred or approximated within a bounded error. 

The collection of the y-frequent itemsets and their frequencies can be consid- 
ered as an 7 / 2 -adequate representation w.r.t. frequency queries [12]. It means 
that the error on the inference of a frequency for a given itemset is bounded by 
7 / 2 . Indeed, the frequency of an infrequent itemset can be set to 7/2 while the 
frequency of a frequent one is known exactly. Given a set S of pairs {X,!F{X)), 
e.g., the collection of all the frequent itemsets and their frequencies, we are inter- 
ested in condensed representations of S that are subsets of S with two properties: 
(1) They are much smaller than S and faster to compute, and (2), the whole 
set S can be generated from the condensed representation with no access to the 
database, i.e., efficiently. 

We have introduced in Section 2.3 the concepts of closed sets, free sets and 6- 
free sets. Disjoint-free itemsets are a generalization of free itemsets [25]. They are 
all condensed representations of the frequent itemsets that are exact representa- 
tions (no loss of information w.r.t. the frequent itemsets and their frequencies), 
except for the ^-free itemsets (with 6 0) which is an approximative one. Let 

us now give the principle of regeneration from the frequent closed itemsets: 

— Given an itemset S and the set of frequent closed itemsets, 

• If 5 is not included in a frequent closed itemset then S is not frequent. 

• Else S is frequent and IG(S') = Max{lF(X), S C X A Cciose{X)}. 

As a result, 7 -frequent closed itemsets are like the 7 -frequent itemsets a 7 / 2 - 
adequate representation for frequency queries. 

Example 11. In the database of Figure 1, if the frequency threshold is 0.2, every 
itemset is frequent and the frequent closed sets are C, AC, BC, ABC, and ABCD. 
IF(AB) = IF(ABC) since ABC is the smallest closed superset of AB. 

The regeneration from 5-free itemsets is provided later. By construction, 
< \SATcy,J and \SATc,_j,J < \SATcy,J when 5 > 0. Also, in 
practice, the size of these condensed representations are several orders of mag- 
nitude lower than the size of the frequent itemsets for dense data sets [24] . 

Several algorithms exist to compute various condensed representations of 
frequent itemsets [65,69,75,12,15,5,25]. These algorithms compute different con- 
densed representations: the frequent closed itemsets (Glose, Gloset, Gharm), 
the frequent free itemsets (Min-Ex, Pascal), the frequent Afree itemsets (Min- 
Ex), or the disjoint-free itemsets (H/VlinEx). Tractable extractions from dense 
and highly-correlated data have become possible for frequency thresholds on 
which previous algorithms are intractable. 
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Representations based on 5-free itemsets are quite interesting when it is not 
possible to mine the closed sets or even the disjoint-free sets, i.e., when the 
computation is intractable given the user-defined frequency threshold. Indeed, 
algorithms like Close [65] or Pascal [5] or H/Vlin-Ex [25] use special kinds of 
logical rules to prune candidate itemsets because their frequencies can be inferred 
from the frequencies of others. However, to be efficient, these algorithms need 
that such logical rules hold in the data. 

Let us now consider the ^-free itemsets and how they can be used to answer 
frequency queries. The output of the Min-Ex algorithm [16] is formally given by 
the three following sets: FF{r,'^,6) is the set of the y-frequent 5-free itemsets, 
IF{r,j,6) is the set of the minimal (w.r.t. the set inclusion) infrequent ^-free 
itemsets (i.e., the infrequent Wree itemsets whose all subsets are y-frequent). 
FlV(r,y,^) is the set of the minimal y-frequent non-^-free itemsets (i.e., the y- 
frequent non-5-free itemsets whose all subsets are ^-free). The two pairs {FF, IF) 
and {FF, FN) are two condensed representations based on ^-free itemsets. 

It is possible to compute an approximation of the frequency of an itemset 
using one of these two condensed representations: 

— Let S be an itemset. If there exists X G IF{r, y, 6) such that X C S then S is 
infrequent. If S' ^ FF{r,^,6) and there does not exist X G FN{r,j,6) such 
that X C S then S is infrequent. In these two cases, the frequency of S can 
be approximated by y/2 Else, let F be the Wree itemset such that: F{F) = 
Min{lF(X), X C S and X is Wree}. Assuming that ns = ]support(S)| and 
np = ]support(F)|, then np > n$ > np — ^(]S| — ]E|), or, dividing this by 
n, the number of rows in r, F{F) > F{S) > F{F) — y(]S| — |F|). 

It is thus possible to regenerate an approximation of the answer to a frequent 
itemset query from one of the condensed representation {FF,IF) or {FF,FN). 
Typical 6 values range from zero to a few hundreds. With a database size of 
several tens of thousands of rows, the error made is below few percents [16]. If 
^ = 0, then the two condensed representations enable to regenerate exactly the 
answer to a frequent itemset query. 

This line of work has inspired other researchers. For instance, [26] proposed 
a new exact condensed representation of the frequent itemsets that generalizes 
the previous ones. It is, to the best of our knowledge, the most interesting exact 
representation identified so far. In [66], new approximative condensed representa- 
tions are proposed that are built from the maximal frequent itemsets for various 
frequency values. 

The condensed representations can be used also for constraint-based mining 
of itemsets and the optimization of sequence of queries. In [46,45], constraint- 
based mining under conjunctions of anti-monotonic and monotonic constraints 
is combined with condensed representations. Some technical problems have to 
be solved and it has lead to the concept of contextual 6-free itemsets w.r.t. a 
monotonic constraint [19]. The use of condensed representations is not limited 
to the optimization of single queries. [47] describes the use of a cache that con- 
tains free itemsets to optimize the evaluation of sequences of itemset queries. 
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Notice that other researchers also consider the optimization of sequences based 
on condensed representations like the free itemsets [35]. 

4.3 Optimizing Association Rule Mining Queries 

MINE RULE [59] has been designed by researchers who belong to the cInQ con- 
sortium. This is one of the query languages dedicated to association rule mining 
[9,10]. New extensions to the MINE RULE operator have been studied [11,58]. 
Two important and challenging new notions include: pattern views and relations 
among inductive queries. Both of these notions have also been included (and were 
actually inspired on MINE RULE) in the logical inductive database theory [29]. 
Pattern views intensionally specify a set of patterns using an inductive query in 
MINE RULE. This is similar in spirit to a traditional relation view in a traditional 
database. The view relation is defined by a query and can be queried like any 
other relation later on. It is the task of the (inductive) database management 
to take care (using, e.g., query materialization or query transformation) that 
the right answers are generated to such views. Pattern views raise many new 
challenges to data mining. The other notion that is nicely elaborated in MINE 
RULE concerns the dominance and subsumption relation between consecutive in- 
ductive queries. [11] studies the properties that the sets of patterns generated 
by two MINE RULE queries present in interesting situations. For instance, given 
two similar queries that are identical apart from one or more clauses in which 
they differ for an attribute, the result-sets of the two queries exhibit an inclusion 
relationship when a functional dependency is present between the differing at- 
tributes. [11] studies also the equivalence properties that two MINE RULE queries 
present when they have two clauses with constraints on attributes that are func- 
tionally dependent. Finally, it studies the properties that the queries have when 
multiple keys of a relation are involved. All these notions, if elaborated in the 
context of the inductive databases, will help the system to speed-up the query 
answering procedures. Again these ideas have been carried over to the logical 
theory of inductive databases [29] . 

Partners of the consortium have been inspired by the MINE RULE query lan- 
guage to study information discovery from XML data by means of association 
rules [22,21]. 

4.4 Towards a Theory of Inductive Databases 

The final goal of the cInQ project is to propose a theory of inductive databases. 
As a first valuable step, a logical and set-oriented theory of inductive databases 
has been proposed [29,30,31], where the key idea is that a database consists of 
sets of data sets and sets of pattern sets. Furthermore there is an inductive query 
language, where each query either generates a data or a pattern set. Queries 
generating patterns sets are — in their most general form — arbitrary boolean 
expression over monotonic and anti-monotonic primitives. This corresponds to a 
logical view of inductive databases because the queries are boolean expressions 
as well as a set oriented one because the answers to inductive queries are sets of 
patterns. 
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Issues concerned with the evaluation and optimization of such inductive 
queries based on the border set representations can be found in [30]. Further- 
more, various other issues concerned with inductive pattern views and the mem- 
ory organization of such logical inductive databases are explored in [29]. Finally, 
various formal properties of arbitrary boolean inductive queries (e.g., normal 
forms, minimal number of version spaces needed) have been studied [31]. 

Interestingly, these theoretical results have emerged from an abstraction of 
useful KDD processes, e.g., for molecular fragment discovery with the domain 
specific inductive database MolFea [32,50,49,48] or for association rule mining 
processes with, e.g., the MINE RULE operator. 

5 Conclusions 

We provided a presentation of the itemset pattern domain. Any progress on 
constraint-based mining for itemsets can influence the research on the multiple 
uses of frequent itemsets (feature construction, similarity measures and cluster- 
ing, classification rule mining or bayesian network construction, etc). It means 
that, not only (more or less generalized) association rule mining in difficult con- 
texts like dense data sets can become tractable but also many other data mining 
processes can benefit from this outcome. 

We introduced most of the results obtained by the cInQ consortium after 18 
months of work. A few concepts have emerged that are now studied in depth, 
e.g., approximative and exact condensed representations, relationships between 
inductive query solutions and versions spaces, strategies for the active use of con- 
straints during inductive query evaluation, containment and dominance between 
mining queries. 

A lot has yet to be done, e.g., towards the use of these concepts for predictive 
data mining tasks. Also, we have to study the robustness of these concepts in 
various application domains and thus different pattern domains. It is a key issue 
to identify a set of data mining primitives and thus figure out what could be 
a good query language for inductive databases. Indeed, the design of dedicated 
inductive databases, e.g., inductive databases for molecular fragment discovery, 
is an invaluable step. Not only it solves interesting applicative problems but also 
it gives the material for abstraction and thus the foundations of the inductive 
database framework. 
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Abstract. Recently, inductive databases (IDBs) have been proposed to 
tackle the problem of knowledge discovery from huge databases. With 
an IDB, the user/analyst performs a set of very different operations on 
data using a query language, powerful enough to support all the required 
manipulations, such as data preprocessing, pattern discovery and pattern 
post-processing. We provide a comparison between three query languages 
(MSQL, DMQL and MINE RULE) that have been proposed for descriptive 
rule mining and discuss their common features and differences. These 
query languages look like extensions of SQL. We present them using 
a set of examples, taken from the real practice of rule mining. In the 
paper we discuss also OLE DB for Data Mining and Predictive Model 
Markup Language, two recent proposals that like the first three query 
languages respectively provide native support to data mining primitives 
and provide a description in a standard language of statistical and data 
mining models. 



1 Introduction 

Knowledge Discovery in Databases (KDD) is a complex process which involves 
many steps that must be done sequentially. When considering the whole KDD 
process, the proposed approaches and querying tools are still unsatisfactory. The 
relation among the various proposals is also sometimes unclear because, at the 
moment, a general understanding of the fundamental primitives and principles 
that are necessary to support the search of knowledge in databases is still lacking. 

In the cInQ project^, we want to develop a new generation of databases, 
called “inductive databases” , as suggested in [5]. This kind of databases inte- 
grates raw data with knowledge extracted from raw data, materialized under the 
form of patterns, into a common framework that supports the KDD process. In 
this way, the KDD process consists essentially in a querying process, enabled by 
a powerful query language that can deal with both raw data and patterns. A few 
query languages can be considered as candidates for inductive databases. Most 
proposals emphasize one of the different phases of the KDD process. This paper 



^ Project (1ST 2000-26469) partially funded by the EC 1ST Programme - PET. 



R. Meo et al. (Eds.): Database Support for Data Mining Applications, LNAI 2682, pp. 24-51, 2004. 
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is a critical evaluation of three proposals in the light of the IDEs’ requirements: 
MSQL [6,7], DMQL [10,11] and MINE RULE [12,13]. In the paper we discuss also OLE 
DB for Data Mining (OLE DB DM) by Microsoft and Predictive Model Markup 
Language (PMML) by Data Mining Group [18]. OLE DB DM is an Application Pro- 
gramming Interface whose aim is to ease the task of developing data mining 
applications over databases. It is related to the other query languages because 
like them it provides native support for data mining primitives. PMML, instead, is 
a standard markup language, based on XML, and describes statistical and data 
mining models. 

The paper is organized as follows. Section 2 summarizes the desired proper- 
ties of a language for mining inside an inductive database. Section 3 introduces 
the main features of the analyzed languages, whereas in Section 4 some real ex- 
amples of queries are discussed, so that the comparison between the languages 
is straightforward. Finally Section 5 draws some conclusions. 



2 Desired Properties of a Data Mining Query Language 

A query language for IDEs, is an extension of a database query language that 
includes primitives for supporting the steps of a KDD process, that are: 

— The selection of data to be mined. The language must offer the possibility 
to select (e.g., via standard queries but also by means of sampling), to ma- 
nipulate and to query data and views in the database. It must also provide 
support for multi-dimensional data manipulation. 

DMQL, MINE RULE and OLE DB DM allow the selection of data. Neither 
of them has primitives for sampling. All of them allow multi- 
dimensional data manipulation (because this is inherent to SQL). 

— The specification of the type of patterns to be mined. Clearly, real-life KDD 
processes need for different kinds of patterns like various types of descriptive 
rules, clusters or predictive models. 

DMQL considers different patterns beyond association rules. 

— The specification of the needed background knowledge (e.g., the definition 
of a concept hierarchy). 

Even though both MINE RULE and MSQL can treat hierarchies if the 
relationship 'is-a’ is represented in a companion relation, DMQL 
allows its explicit definition and use during the pattern extraction. 

— The definition of constraints that the extracted patterns must satisfy. This 
implies that the language allows the user to define constraints that specify 
the interesting patterns (e.g., using measures like frequency, generality, cov- 
erage, similarity, etc). 

DMQL, MSQL and MINE RULE allow the specification of various kinds 
of constraints based on rule elements, rule cardinality and aggre- 
gate values. They allow the specification of primitive constraints 
based on support and confidence measures. DMQL allows some other 
measures like novelty. 
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~ The satisfaction of the closure property (by storing the results in the database). 

All of them satisfy this property. 

— The post-processing of results. The language must allow to browse the pat- 
terns, apply selection templates, cross over patterns and data, e.g., by se- 
lecting the data in which some patterns hold, or aggregating results. 

MSQL is richer than the other languages in its offer of few post- 
processing primitives (it has a dedicated operator, SelectRules). 
DMQL allows some visualization options. However, all the languages 
are quite poor for rule post-processing. 

3 Query Languages for Rule Mining 

3.1 MSQL 

MSQL [7,14] has been designed at the Rutgers University, New Jersey, USA. 
Rules in MSQL are based on descriptors, each descriptor being an expression of 
the form (Ai = atj), where Ai is an attribute and Oij is a value or a range of 
values in the domain of Aj. A conjunctset is the conjunction of an arbitrary 
number of descriptors, provided that there is no pair of descriptors built on the 
same attribute. In practice, MSQL extracts propositional rules like B, where 
A is a conjunctset and B is a, descriptor (it follows that only a single proposition 
is allowed in the consequent). We say that a tuple t of a relation R satisfies a 
descriptor {Ai = Uij) if the value of Ai in t is equal to a^. Moreover, t satisfies 
a conjunctset C if it satisfies all the descriptors of C. Finally, t satisfies a rule 
A ,8 if it satisfies all the descriptors in A and B, but it violates the rule 

A 8 if it does not satisfy A or B. Notice that support of a rule is defined 

as the number of tuples satisfying A in the relation on which the mining has 
been performed, and the confidence is the ratio between the number of tuples 
satisfying both A and B and the support of the rule. An example of a rule 
extracted from Emp{empjid, job, sex,car) relation containing employee data is 
{job = doctor) A {sex = male) {car = BMW). 

The main features of MSQL, as stated by the authors, are: 

— Ability to nest SQL expressions such as sorting and grouping in a MSQL 
statement and allowing nested SQL queries by means of the WHERE clause. 

— Satisfaction of the closure property and availability of operators to further 
manipulate results of previous MSQL queries. 

— Cross-over between data and rules with operations allowing to identify sub- 
sets of data satisfying or violating a given set of rules. 

— Distinction between rule generation and rule querying. This allows splitting 
rule generation, that is computationally expensive from rule post-processing, 
that must be as interactive as possible. 

MSQL comprises four basic statements (see Section 4 for examples): 

— Create Encoding that encodes continuous valued attributes into discrete 
values. Notice that during mining, the discretization is done “on the fly” , so 
that it is not necessary to materialize a separate copy of the table. 




Query Languages Supporting Descriptive Rule Mining 



27 



— A GetRules query computes rules from the data and materializes them into 
a rule database. Its syntax is as follows: 

[Project Body, Consequent, confidence, support] 

GetRules (C) [as Rl] [into <rulebase_name>] 

[where (RC I PC I MC I SQ)] 

[sql-group-by clause] [using-encoding-clause] 

A GetRules query can deal with different conditions on the rules: 

• Rule format condition (RC), that enables to restrict the items occuring 
in the rules elements. RC has the following format: 

Body { in I has I is } <descriptor-list> 

Consequent { in I is } <descriptor-list> 

• Pruning condition (PC), that defines thresholds for support and con- 
fidence values, and constraints on the length of the rules. PC has the 
format : 

confidence <relop> <float-val in [0.0,1.0]> 

support <relop> <integer> 

support <relop> <float-val in [0.0,1.0]> 

length <relop> <integer> 

relop : := { < I <= I = I >= I > > 

• Mutex condition (MC), that avoids two given attributes to occur in the 
same rule (useful when we know some functional dependencies between 
attributes). Its syntax is: 

Where <other-conditions> 

{ AND I DR } mutex (method, method [, method]) 

[{ AND I OR} mutex (method, method [, method])] 

• Subquery conditions (SQ), which are subqueries connected with the con- 
ventional WHERE keyword using IN and (NOT) EXISTS. 

— A SelectRules query can be used for rule post-processing, i.e., querying 
previously extracted rules. Its syntax is as follows: 

SelectRules (rulebase_name) [where <conditions>] 

where <conditions> concerns the body, the consequent, the support and/or 
the confidence of the rules. 

— Satisfies and Violates, that allow to cross-over data and rules. These two 
statements can be used together with a database selection statement, inside 
the WHERE clause of a query. 



3.2 MINE RULE 

MINE RULE has been designed at the Politecnico di Torino and the Politecnico di 
Milano, Italy [12,13]. This operator extracts a set of association rules from the 
database and stores them back in the database in a separate relation. 

An association rule extracted by MINE RULE from a source relation is defined 
as follows. Let us consider a source relation over the schema S. Let TZ and Q 
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be two disjoint subsets of S called respectively the schema of the rules and 
the grouping attributes. An association rule is extracted from (or satisfied by) 
at least a group of the source relation, where each group is a partition of the 
relation by the values of the grouping attributes Q. An association rule has the 
form B, where A and B are sets of rule elements {A is the body of the rule 
and B the head). The elements of a rule are taken from the tuples of one group. 
In particular, each rule element is a projection over (a subset of) TZ. Note that 
however, for a given MINE RULE statement, the schema of the body and head 
elements is unique, even though they may be different. 

An example of a rule extracted from the relation Emp{empJd, job, sex, car) 
grouped by empJd with the schema of the body {job, sex) and the schema of 
the head (car) is the following: {{doctor, male)} {{BMW)}. This rule is ex- 
tracted from within tuples, because each group coincides with a tuple of the rela- 
tion. Instead, for the relation Sales{transactionSd, item, customer, payment), 
collecting data on customers purchases, grouped by customer and with the rule 
schema {item) (where body and head schemas are coincident) a rule could be 
{{pasta), {oil), {tomatoes)} {{wine)}. 

The MINE RULE language is an extension of SQL. Its main features are: 

~ Selection of the relevant set of data for a data mining process. This feature 
is applied at different granularity levels, that is at the row level (selection of 
a subset of the rows of a relation) or at the group level {group condition). 
The grouping condition determines which data of the relation can take part 
to an association rule. This feature is similar to the grouping conditions that 
we can find in conventional SQL. The definition of groups, i.e. the partitions 
from which the rules are extracted, is made at run time and is not decided 
a priori with the key of the source relation (as in DMQL). 

— Definition of the structure of the rules. This feature defines single-dimensional 
association rules (i.e., rule elements are the different values of the same di- 
mension or attribute), or multi-dimensional rules (rule elements involve the 
value of more than one attribute). The structure of the rules can also be 
constrained by specifying the cardinality of the rule’s body and head. 

— Definition of constraints applied at different granularity levels. Constraints 
belong to two categories: constraints applied at the rule level {mining con- 
ditions), and constraints applied at the cluster level {cluster conditions). A 
mining condition is a constraint that is evaluated and satisfied by each tu- 
ple whose attributes, as rule elements, are involved in the rule. A cluster 
condition is a constraint evaluated for each cluster. Clusters are subgroups 
(or partitions) of the main groups that are created keeping together tuples 
of the same group that present common features (i.e., the value of the clus- 
tering attributes) . In presence of clusters, rule body and head are extracted 
from a pair of clusters of the same group satisfying the cluster conditions. 
For instance, clusters and cluster condition may be exploited in order to 
extract association rules in which body and head are ordered and therefore 
constitute the elementary patterns of sequences. 
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— Definition of rule evaluation measures. Practically, the language allows to 
define support and confidence thresholds. 

The general syntax of MINE RULE follows: 

<MineRuleOp> : = MINE RULE <TabIeName> AS 
SELECT DISTINCT <BodyDescr>, <HeadDescr> [.SUPPORT] [.CONFIDENCE] 
[WHERE <WhereCIause>] 

FROM <FromList> [ WHERE <WhereCIause> ] 

GROUP BY <AttrList> [ HAVING <HavingCIause> ] 

[ CLUSTER BY <AttrList> [ HAVING <HavingCIause> ]] 

EXTRACTING RULES WITH SUPPORT : <real> . CONFIDENCE : <real> 

<BodyDescr> : = [ <CardSpec> ] <AttrList> AS BODY 
<BodyDescr> : = [ <CardSpec> ] <AttrList> AS HEAD 
<CardSpec> : =<Number> .. (<Number> I n) 

<AttrList> : =<AttrName> [ . <AttrList>] 

3.3 DMQL 

DMQL has been designed at the Simon Fraser University, Canada [10,11]. In DMQL, 
an association rule is a relation between the values of two sets of predicates that 
are evaluated on the relations of the database. These predicates are of the form 
P{X, c) where P is a predicate that takes the name of an attribute of the underly- 
ing relation, X is a variable and c is a constant value belonging to the attribute’s 
domain. The predicate is satisfied if in the relation there exists a tuple identified 
by the variable X whose homonymous attribute takes the value c. Notice that it is 
possible for the predicates to be evaluated on different relations of the database. 
For instance, DMQL can extract rules like town{X' London') buys{X,' DVD') 
where town and buys may be two attributes of different relations and X is an at- 
tribute present in the both relations. Rules may belong to different categories: a 
single-dimensional rule contains multiple occurrences of a single predicate (e.g., 
buys) while a multi-dimensional rule involves more predicates, each of which 
occurs only once in the rule. However, the presence of one or more instances of a 
predicate in the same rule can be specified by the name of the predicate followed 
by -F. Another important feature of DMQL is that it allows to guide the discovery 
process by using metapatterns. Metapatterns are a kind of templates that re- 
stricts the syntactical aspect of the association rules to be extracted. Moreover, 
they represent a way to push some hypotheses of the user and it is possible to 
incorporate some further constraints in them. An example of metapattern could 
be town{X : customer, London) A income{X,Y) =F buys{X,Z), which restricts 
the discovery to rules with a body concerning town and income levels of the 
customers and a head concerning one item bought by those customers. Further- 
more, a metapattern can allow the presence of non instantiated predicates that 
the mining task will take care to instantiate to the name of a valid attribute 
of the underlying relation. For instance, if we want to extract association rules 
describing the customers traits that are frequently related to the purchase of 
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certain items by those customers we could use the following metapattern to 
guide the association rule mining: 

P{X : customer, W) A Q{X, Y) buys{X, Z) 

where P and Q are predicate variables that can be instantiated to the rele- 
vant attributes of the relations under examination, Af is a key of the customer 
relation, W, Y and Z are object variables that can assume the values of the 
respective predicates for customer X. 

DMQL consists of the specification of four major primitives in data mining, 
that are the following: 

— The set of relevant data w.r.t. a data mining process. 

This primitive can be specified like in a conventional relational query ex- 
tracting the set of relevant data from the database. 

— The kind of knowledge to be discovered. 

This primitive may include association rules, classification rules (rules that 
assign data to disjoint classes according to the value of a chosen classifying 
attribute), characteristics (descriptions that constitute a summarization of 
the common properties in a given set of data), comparisons (descriptions 
that allow to compare the total number of tuples belonging to a class with 
different contrasting classes), generalized relations (obtained by generalizing 
a set of data corresponding to low level concepts with data corresponding to 
higher level concepts according to a specified concept hierarchy). 

— The background knowledge. 

This primitive manages a set of concept hierachies or generalization opera- 
tors which assist the generalization process. 

— The justification of the interestingness of the knowledge (i.e., thresholds). 
This primitive is included as a set of different constraints depending on the 
kind of target rules. For association rules, e.g., besides the classical support 
and confidence thresholds, DMQL allows the specification of noise (the mini- 
mum percentage of tuples in the database that must satisfy a rule so that it 
is not discarded) and rule novelty, for selecting the most specific rules. 

The DMQL grammar for extracting association rules is an extension of the 
conventional SQL grammar. Thus, we can find in it traditional relational oper- 
ators like HAVING, WHERE, ORDER BY and GROUP BY, but we can also specify the 
database, select the relevant attributes of the database relation and the concept 
hierarchy, define thresholds and guide the mining process using a metapattern. 
The general syntax of a DMQL query is: 

use database (database-name) 

{use hierarchy (hierarchy -name) for (attributc-or -dimension) } 
in relevance to (attribute-or -dimension-list) 

mine associations [as (pattern-name)] [matching (metapattern)] 
from (relation(s) / cube(s)) [where (condition)] 

[order by (order Jist)] 

[group by (grouping -lisf)][ha.-ving (condition)] 
with (interest-measure) threshold = value 
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3.4 OLE DB DM 

OLE DB DM has been designed at Microsoft Corporation [17] [16]. It is an exten- 
sion of the OLE DB Application Programming Interface (API) that allows any 
application to easily access a relational data source under the Windows family 
OS. The main motivation of the design of OLE DB for DM is to ease the de- 
velopment of data mining projects with applications that are not stand-alone 
but are tightly-coupled with the DBMS. Indeed, research work in data min- 
ing focused on scaling analysis and algorithms running outside the DBMS on 
data exported from the databases in files. This situation generates problems in 
the deployment of the data mining models produced because the data manage- 
ment and maintenance of the model occurs outside of the DBMS and must be 
solved by ad-hoc solutions. OLE DB DM aims at ease the burden of making data 
sources communicate with data mining algorithms (also called mining model 
provider) . 

The key idea of OLE DB DM is the definition of a data mining model, i.e. a 
special sort of table whose rows contain an abstraction, a synthetic description 
of input data (called case set). The user can populate this model with predicted 
or summary data obtained running a data mining algorithm, specified as part 
of the model, over the case set. Once the mining task is done, it is possible 
to use the data mining model, for instance to predict some values over new 
cases, or browse the model for post-processing activities, such as reporting or 
visualization. 

The representation of the data in the model depends on the format of data 
produced by the algorithm. This one could produce output data for instance 
by using PMML (Predictive Model Markup Language [18]). PMML is a standard 
proposed by DMG based on XML. It is a mark-up language for the description 
of statistical and data mining models. PMML describes the inputs of data min- 
ing models, the data transformations used for the preparation of data and the 
parameters used for the generation of the models themselves. 

OLE DB DM provides an SQL-like language that allows client applications to 
perform the key operations in the OLE DB DM framework: definition of a data 
mining model (with the CREATE MINING MODEL statement), execution of an ex- 
ternal mining algorithms on data provided by a relational source and population 
of the data mining model (INSERT INTO statement), prediction of the value of 
some attributes on new data (PREDICTION JOIN), browsing of the model (SELECT 
statement). 

Thus, elaboration of an OLE DB DM model can be done using classical SQL 
queries. Once the mining algorithm has been executed, it is prossible to do some 
crossing-over between the data mining model and the data fitting the mining 
model using the PREDICTION JOIN statement. This is a special form of the SQL 
join that allows to predict the value of some attributes in the input data (test 
data) according to the model, provided that these attributes were specified as 
prediction attributes in the mining model. 
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The grammar for the creation of a data mining model is the following: 



<dm_create> : :=CREATE MINING MODEL <identifier> (<coI_def _Iist>) 
USING <aIgorithm> [(<aIgo_parain_Iist>)] 

<coI_def _Iist> : : = <coI_def> I <coI_def _Iist> , <coI_def> 
<coI_def>::= <coI_def _reg> I <coI_def _tbl> 

<coI_def _reg> : : = <identifier> <coI_type> [<coI_distribution>] 
[<coI_binary>] [<coI_content>] [<coI_content_quaI>] 
[<coI_quaIif >] [<coI_prediction>] [<reIation_cIause>] 

<coI_def _tbl> ::= <identifier> TABLE <coI_prediction> 

( <coI_def _Iist> ) 

111 algorithms currently implemented in SQL server 2000 
<aIgorithm> : : = MICRDSOFT_DECISION_TREES I MICR0S0FT_CLUSTERING 

<aIgo_param_Iist> : : =<aIgo_param> I <aIgo_param> , <aIgo_param_Iist> 
<aIgo_param> : : = <identifier> = <value> 

<coI_type> : : = LONG I BOOLEAN I TEXT I DOUBLE I DATE 

<coI_distribution>-> NORMAL I UNIFORM 

<coI_binary> : : = MODEL_EXISTENCE_ONLY I NOT NULL 

<coI_content>: := DISCRETE I CONTINUOUS 

I DISCRETIZED ( [<disc_method> [, <numeric_const>] ] ) 

I SEQUENCE_TIME 

<disc_method>: :=AUTOMATIC I EQUAL_AREAS I THRESHOLDS I CLUSTERS 
<coI_content_quaI> : : = ORDERED I CYCLICAL 

<coI_quaIif>: := KEY I PROBABILITY I VARIANCE I STDEV I STDDEV 
I PROBABILITY_ VARIANCE I PROBABILITY_STDEV 
I PROBABILITY_STDDEV I SUPPORT 

<coI_prediction>: := PREDICT I PREDICT_ONLY 

<reIation_cIause> : : = <reIated_to_cIause> I <of_cIause> 

<reIated_to_cIause> : : =RELATED TO <identifier> I RELATED TO KEY 

<of_cIause> : := OF <identifier> I OF KEY 
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Notice that the grammar allows to specify many kinds of qualifiers for an 
attribute. For instance, it allows to specify the role of an attribute in the model 
(key) , the type of an attribute, if the attribute domain is ordered or cyclical, if it 
is continuous or discrete (and in this latter case the type of discretization used), 
if the attribute is a measurement of time, and its range, etc. It is possible to 
give a probability and other statistical features associated to an attribute value. 
The probability specifies the degree of certainty that the value of the attribute 
is correct. 

PREDICT keyword specifies that it is a prediction attribute. This means that 
the content of the attribute will be predicted on test data by the data mining 
algorithm according to the values of the other attributes of the model. 

RELATED TO allows to associate the current attribute to other attributes, for 
instance for a foreign key relationship or because the attribute is used to classify 
the values of another attribute. 

Notice that <col_def _tbl> production rule allows a data mining model 
to contain nested tables. Nested tables are tables stored as the single values 
of a column in an outer table. The input data of a mining algorithm are often 
obtained by gathering and joining information that is scattered in different tables 
of the database. For instance, customer information and sales information are 
generally kept in different tables. Thus, when joining the customer and the sales 
tables, it is possible to store in a nested table of the model all the items that have 
been bought by a given customer. Thus, nested tables allow to reduce redundant 
information in the model. 

Notice that OLE DB DM seems particularly tailored to predictive tasks, i.e. to 
predict the value of an attribute in a relational table. Indeed, the current im- 
plementation of OLE DB DM in Microsoft SQL Server 2000, only two algorithms 
are provided (Microsoft Decision Trees and Microsoft Clustering) and both of 
them are designed for attribute prediction. Instead, algorithms that use data 
mining models for the discovery of association rules, therefore for tasks without 
a direct predictive purpose, seems not currently supported by OLE DB DM. How- 
ever, according to the specifications [17], OLE DB DM should be soon extended 
for association rules mining. 

Notice also that it is possible to directly create a mining model that conforms 
to the PMML standard using the following statement: 

<pmml_create> : : =CREATE MINING MODEL <id> FROM PMML <string> 

We recall here the schema used by PMML for the definition of models based 
on association rules. 

<! ENTITY \7. FIELD-USAGE-TYPE "(active I 

predicted I 
supplementary) " > 

<! ENTITY \7. OUTLIER-TREAT-METHOD "( asis I 

asMissingValues I 
asExtremeValues ) " > 
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< ! ENTITY 



\y. MISS-VALUE-TREAT-METHDD " (asis I asMean I 

asMode I asMedian I 
asValue) " > 



<! ELEMENT MiningField (Extension*) > 
<!ATTLIST MiningField 



narnie 

usageType 

outliers 

lowValue 

highValue 



\7.FIELD-NAME; 
V/.FIELD-USAGE-TYPE ; 
\y.OUTLIER-TREAT-METHOD ; 
\7.NUMBER; 

\y.NUMBER; 



missingValueReplacement CDATA 



#REQUIRED 

"active" 

"asIs" 

#IMPLIED 

#IMPLIED 

#IMPLIED 



missingValueTreatment \7.MISS-VALUE-TREAT-METH0D; #IMPLIED 



<! ELEMENT MiningSchema (MiningFieldt) > 

Notice that according to this specification it is possible to specify the schema 
of a model giving the name, type, range of values of each attribute. Furthermore, 
it is possible to specify the treatment method if the value of the attribute is 
missing, or if it is an outlier w.r.t. the predicted value for that attribute. 



3.5 Feature Summary 

Table 1 summarizes the different features of an ideal query language for rule 
mining and shows how the studied proposals satisfy them as discussed in pre- 
vious Sections. Notice that the fact that OLE DB DM supports or not some of 



Table 1. Summary of the main features of the different languages. ^Depending 
on the algorithm. ^Only association rules. ^Association rules and elementary 
sequential patterns. ^Concept hierarchies. ^Selectrules, satisfies and violates. 
^Operators for visualization. ^PREDICTION JOIN. ^Algorithm parameters 



Feature 


MSqL 


MINE RULE 


DMQL 


OLE DB DM 


Satisfaction of the 
closure property 


Yes 


Yes 


Yes 


Ye? 


Selection of source data 


No 


Yes 


Yes 


Yes 


Specification of 
different types 
of patterns 


W 


Some"’ 


Yes 


Not directly’^ 


Specification of the 
Background Knowledge 


No 


No 


Some"* 


No 


Post-processing of 
the generated results 


Yei^ 


No 


Some" 


Some' 


Specification of 
constraints 


Yes 


Yes 


Yes 


N? 
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the features reported in Table 1 depends strictly by the data mining algorithm 
referenced in the data mining model. Instead, OLE DB DM guarantees naturally 
the selection of source data, since this feature is its main purpose. 

When considering different languages, it is important to identify precisely the 
kind of descriptive rules that are extracted. All the languages can extract intra- 
tuple association rules, i.e. rules that associate values of attributes of a tuple. The 
obtained association rules describe the common properties of (a sufficient number 
of) tuples of the relation. Instead, only DMQL and MINE RULE can extract inter- 
tuple association rules, i.e. rules that associate the values of attributes of different 
tuples and therefore describe the properties of a set of tuples. Nested tables in 
the data mining model of OLE DB DM could ease the extraction of inter-tuple 
association rules by the data mining algorithm. Indeed, nested tables include in 
an unique row of the model the features of different tuples of the source, original 
tables. Thus, intra-tuple association rules seem to constitute the common “core” 
of the expressive capabilities of the three languages. 

The language capability of dealing with inter-tuple rules affects the represen- 
tation of the input for the mining engine. As already said, MSQL considers only 
intra-tuple association rules. As illustrated in the next section, this limit may be 
overcome by a change of representation of the input relation, i.e., by inclusion 
of the relevant attributes of different tuples in a unique tuple of a new relation. 
However, this can be a tedious and long pre-processing work. Furthermore in 
these cases, the MSQL statements that catch the same semantics of the analo- 
gous statements in DMQL and MINE RULE, can be very complex and difficult to 
understand. 

As a last example of the different capabilities of the languages, we can mention 
that while DMQL and MINE RULE effectively use aggregate functions (resp. on rule 
elements and on clusters) for the extraction of association rules, MSQL provides 
them only as a post-processing tool over the results. 

4 Comparative Examples 

We describe here a complete KDD process centered around the classical basket 
analysis problem that will serve as a running example throughout the paper. 

We are considering information of relations Sales, Transactions and Cus- 
tomers shown in Figure 1. In relation Sales we have stored information on sold 
items in the purchase transactions; in relation Transactions we identify the 
customers that have purchased in the transactions and record the method of 
payment; in relation Customers we collect information on the customers. 

From the information of these tables we want to look for association rules 
between bought items and customer’s age for payments with credit cards. The 
discovered association rules are meant to predict the age of customers according 
to their purchase habits. This data mining step requires at first some manipula- 
tions as a preprocessing step (selection of the items bought by credit card and 
encoding of the age attribute) in order to prepare data for the successive pattern 
extraction; then the actual pattern extraction step may take place. 
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transactiondd 


item 


1 


ski_pants 


1 


hiking_boots 


2 


coLshirts 


2 


brown_boots 


3 


coLshirts 


3 


brown_boots 


4 


jackets 


5 


coLshirts 


5 


jackets 


6 


hiking_boots 


6 


brown_boots 


7 


ski_pants 


7 


hiking_boots 


7 


brown_boots 


8 


ski_pants 


8 


hiking_boots 


8 


brown_boots 


8 


jackets 


9 


hiking_boots 


10 


ski_pants 


11 


ski_pants 


11 


brown_boots 


11 


jackets 



transaction_id 


customer 


payment 


1 


cl 


credit _card 


2 


c2 


credit _card 


3 


c3 


cash 


4 


c4 


credit _card 


5 


c5 


credit _card 


6 


c6 


cash 


7 


c7 


credit _card 


8 


c8 


credit _card 


9 


c9 


credit _card 


10 


c3 


credit _card 


11 


c2 


cash 



customer _id 


customer_age 


job 


cl 


26 


employee 


c2 


35 


manager 


c3 


48 


manager 


c4 


39 


engineer 


c5 


46 


teacher 


c6 


25 


student 


c7 


29 


employee 


c8 


24 


student 


c9 


28 


employee 



Fig. 1. Sales table (on the left); Transactions table (on the right above); 
Customers table (on the right below) 

Suppose that by inspecting the result of a previous data mining extraction 
step, we are now interested in investigating the purchases that violate certain 
extracted patterns. In particular, we are interested in obtaining association rules 
between sets of bought items in the purchase transactions that violate the rules 
with ‘ski_pants’ in their antecedent. To this aim, we can cross-over between 
extracted rules and original data, selecting tuples of the source table that violate 
the interesting rules, and perform a second mining step, based on the results 
of the previous mining step: from the selected set of tuples, we extract the 
association rules between two sets of items with a high confidence threshold. 
Finally, we allow two post-processing operations over the extracted association 
rules: selection of rules with 2 items in the body and selection of rules with a 
maximal body among the rules with the same consequent. 

4.1 MSQL 

The first thing to do is to represent source data in a suitable format for MSQL. 
Indeed, MSQL expects to receive a unique relation obtained by joining the source 
relations Sales, Transactions and Customers on attributes transaction jid and 
customer-id. Furthermore, the obtained relation must be encoded in a binary 
format such that each tuple represents a transaction with as many boolean 
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Table 2. BoolecUi_Sales transactional table used with MSQL 



t_id 


ski_pants 


hiking_boots 


coLshirts 


brown_boots 


jackets 


customer _age 


payment 


tl 


1 


1 


0 


0 


0 


26 


credit _card 


t2 


0 


0 


1 


1 


0 


35 


credit _card 


t3 


0 


0 


1 


1 


0 


48 


cash 


t4 


0 


0 


0 


0 


1 


39 


credit _card 


t5 


0 


0 


1 


0 


1 


46 


credit _card 


t6 


0 


1 


0 


1 


0 


25 


cash 


t7 


1 


1 


0 


1 


0 


29 


credit _card 


t8 


1 


1 


0 


1 


1 


24 


credit _card 


t9 


0 


1 


0 


0 


0 


28 


credit _card 


tio 


1 


0 


0 


0 


0 


41 


credit _card 


til 


1 


0 


0 


1 


1 


36 


cash 



attributes as are the possible items that a customer can purchase. We obtain 
the relation in Table 2. 

This data trasformation puts in evidence the main weakness of MSQL. MSQL 
is designed to discover the propositional rules satisfied by the values of the 
attributes inside a tuple of a table. If the number of possible items on which a 
propositional rule must be generated is very large (as, for instance the number of 
different products in markets stores) the obtained input table is very large, not 
easily maintainable and user-readable because it contains for each transaction 
all the possible items even if they have not been bought. Boolean table is an 
important fact to take into consideration because its presence is necessary for 
MSQL language, otherwise it cannot work (and so this language is not very much 
flexible in its input) ; furthermore, boolean table requires a data transformation 
which is expensive (especially considering that the volume of tables is huge) and 
must be performed each time a new problem/source table is submitted. 

Pre-processing Step 1: Selection of the Subset of Data to be Mined. 

We are interested only in clients paying with a credit card. MSQL requires that 
we make a selection of the subset of data to be mined, before the extraction 
task. The relation on which we will work is supposed to have been correctly 
selected from the pre-existing set of data in Table 2, by means of a view, named 
View-onSales. 

Pre-processing Step 2: Encoding Age. MSQL provides methods to declare 
encodings on some attributes. It is important to note that MSQL is able to do 
discretization “on the fly” , so that the intermediate encoded value will not appear 
in the final results. The following query will encode the age attribute: 

CREATE ENCODING e_age ON View_on_Sales . customer_age AS 

BEGIN 

(MIN, 9, 0), (10, 19, 1), (20, 29, 2), (30, 39, 3), (1) 

(40, 49, 4), (50, 59, 5), (60, 69, 6), (70, MAX, 7), 0 

END; 

The relation obtained after the two pre-processing steps is shown in Table 3. 
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Table 3. View_on_Sales transactional table after the pre-processing phase 



tJd 


ski_pants 


hiking_boots 


coLshirts 


brown_boots 


jackets 


e_age 


payment 


tl 


1 


1 


0 


0 


0 


2 


credit_card 


t2 


0 


0 


1 


1 


0 


3 


credit_card 


t4 


0 


0 


0 


0 


1 


3 


credit -Card 


t5 


0 


0 


1 


0 


1 


4 


credit _card 


t7 


1 


1 


0 


1 


0 


2 


credit _card 


t8 


1 


1 


0 


1 


1 


2 


credit -Card 


t9 


0 


1 


0 


0 


0 


2 


credit_card 


tio 


1 


0 


0 


0 


0 


4 


credit-card 



Rules Extraction over a Set of Items and Customers’ Age. We want 
to extract rules associating a set of items to the customer’s age and having a 
support over 2 and a confidence over (or equal to) 50%. 

GETRULES(View_on_Sales) INTO SalesRB 

WHERE BODY has { (ski_pants=l) OR (hiking_boots=l) OR (2) 

(col_shirts=l) OR (brown_boots=l) OR (jackets=l)} AND 
Consequent is {(Age = *)} AND support>2 AND conf idence>=0 . 5 

USING e_age FOR customer_age 

This example puts in evidence a limit of MSQL: if the number of items is high, 
the number of predicates in the WHERE clause increases correspondingly! The 
resulting rules are shown in Table 4. 

Table 4. Table SalesRB produced by MSQL in the first rule extraction phase 



Body 


Consequent 


Support 


Confidence 


(ski_pants=l) 


(customer _age= [20,29]) 


3 


75% 


(hiking_boots= 1 ) 


(customer _age= [20,29] ) 


4 


100% 


(brown_boots=l) 


(customer _age= [20,29] ) 


3 


66% 


(ski_pants=l) A (hinking_boots=l) 


(customer _age= [20,29]) 


3 


100% 



Crossing-over: Looking for Exceptions in the Original Data. We select 
tuples from View -onSales that violate all the extracted rules with ski_pants 
in the antecedent (the first and last rule in Table 4). 

INSERT INTO Sales2 AS 
SELECT * FROM View_on_SaIes 

WHERE VIOLATES ALL ( (3) 

SELECTRULES (SalesRB) WHERE BODY HAS { (ski_pants=l) » 

We obtain results given in Table 5. 

Rules Extraction over Two Sets of Items. MSQL does not support a con- 
junction of an arbitrary number of descriptors in the consequent. Therefore, in 
this step we can extract only association rules between one set of items in the 
antecedent and a single item in the consequent. The resulting rule set is only 
{brownhoots = 1) {color shirts = 1) with support=l and confidence=100%. 
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Table 5. Tuples (in Sales2) violating all rules (in SalesRB) with ski_pants in the 
antecedent 



t_id 


ski_pants 


hiking_boots 


coLshirts 


brown_boots 


jackets 


e_age 


t2 


0 


0 


1 


1 


0 


3 


t4 


0 


0 


0 


0 


1 


3 


t5 


0 


0 


1 


0 


1 


4 


t9 


0 


1 


0 


0 


0 


2 


tio 


1 


0 


0 


0 


0 


4 



GETRULES(Sales2) INTO SalesRB2 

WHERE (Body has { (hiking_boots=l) DR (col_shirts=l) 

OR (brown_boots=l) } 

AND Consequent is {(jackets=l)} 

OR Body has { (col_shirts=l) DR (brown_boots=l) OR (jackets=l)} 
AND Consequent is {(hiking_boots=l)} 

DR Body has { (brown_boots=l) DR (jackets=l) (4) 

OR (hiking_boots=l)} 

AND Consequent is {(col_shirts=l)} 

DR Body has {(jackets=l) OR (hiking_boots=l) 

OR (col_shirts=l) } 

AND Consequent is {(brown_boots=l)}) 

AND support>=0.0 AND conf idence>=0 . 9 
USING e_age FOR customer_age 

Notice that in this statement the WHERE clause allows several different con- 
ditions on the Body and on the Consequent, because we wanted to allow in the 
Body a proposition on every possible attribute except one that is allowed to ap- 
pear in the Consequent. Writing this statement was possible because the total 
number of items is small in this toy example but would be impossible for a real 
example in which the number of propositions in the WHERE clause explodes. 

Post-processing Step 1: Manipulation of Rules. Select the rules with 2 
items in the body. 

As MSQL extracts rules with one item in the consequent and it provides the 
primitive length applied to the itemsets originating rules, we specify that the 
total length of the rules is 3. 

SelectRules (SalesRB) where length=3 (5) 

The only rule satisfying this condition is: 

{skijpants = 1) A {hikingJ)oots = 1) {customer -age = [20; 29]) 

Post-processing Step 2: Extraction of Rules with a Maximal Body. It 

is equivalent to require that there is no pair of rules with the same consequent, 
such that the body of the first rule is included in the body of the second one. 
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SELECTRULES(SalesRB) AS R1 

WHERE NOT EXISTS (SELECTRULES (SalesRB) AS R2 

WHERE R2.body has Rl.body (6) 

AND NOT (R2.body is Rl.body) 

AND R2 . consequent is R1 . consequent ) 

There are two rules satisfying this condition: 

{skijpants = 1) A {hikingJ)oots = 1) {customer .age = [20; 29]) 

{brownJroots = 1) {customer mge = [30,39]) 

Pros and Cons of MSQL. Clearly, the main advantage of MSQL is that it is 
possible to query rules as well as data, by using SelectRules on rulebases and 
GetRules on data. Another good point is that MSQL has been designed to be 
an extension of classical SQL, making the language quite easy to understand. 
For example, it is quite simple to test rules against a dataset and to make 
crossing-over between the original data and query results, by using SATISFIES 
and VIOLATES. To be considered as a good candidate language for inductive 
databases, it is clear that MSQL, which is essentially built around the extraction 
phase, should be extended, particularly with a better handling of pre- and post- 
processing steps. For instance, even if it provides some pre-processing operators 
like ENCODE for discretization of quantitative attributes, it does not provide any 
support for complex pre-processing operations, like sampling. Moreover, tuples 
on which the extraction task must be performed are supposed to have been se- 
lected in advance. Concerning the extraction phase, the user can specify some 
constraints on rules to be extracted (e.g., inclusion of an item in the body or in 
the head, rule’s length, mutually exclusive items, etc) and the support and con- 
fidence thresholds. It would be useful however to have the possibility to specify 
more complex constraints and interest measures, for instance user defined ones. 

4.2 MINE RULE 

MINE RULE does not require a specific format for the input table. Therefore 
we can suppose to receive data either in the set of normalized relations Sales, 
Transactions and Customers of Figure 1 or in a view obtained joining them. 
This view is named SalesView and is shown in Table 6 and we assume it is the 
input of the mining task. Using a view is not necessary but it allows to make 
SQL querying easier by gathering all the necessary information in one table 
eventhough all these data are initially scattered in different tables. Thus, the 
user can focus the query writing on the constraints useful for its mining task. 

Pre-processing Step 1: Selection of the Subset of Data to be Mined. 

In contrast to MSQL, MINE RULE does not require to apply some pre-defined view 
on the original data. As it is designed as an extension to SQL, it perfectly nests 
SQL, and thus, it is possible to select the relevant subset of data to be mined 
by specifying it in the FROM . . WHERE . . clauses of the query. 
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Table 6. SalesView view obtained joining the input relations 



transactionjd 


item 


customer _age 


payment 


1 


ski_pants 


26 


credit _card 


1 


hiking_boots 


26 


credit _card 


2 


coLshirts 


35 


credit _card 


2 


brown_boots 


35 


credit _card 


3 


coLshirts 


48 


cash 


3 


brown_boots 


48 


cash 


4 


jackets 


39 


credit _card 


5 


coLshirts 


46 


credit _card 


5 


jackets 


46 


credit _card 


6 


hiking_boots 


25 


cash 


6 


brown_boots 


25 


cash 


7 


ski_pants 


29 


credit _card 


7 


hiking_boots 


29 


credit _card 


7 


brown_boots 


29 


credit _card 


8 


ski_pants 


24 


credit _card 


8 


hiking_boots 


24 


credit _card 


8 


brown_boots 


24 


credit -Card 


8 


jackets 


24 


credit -Card 


9 


hiking_boots 


28 


credit -Card 


10 


ski_pants 


48 


credit -Card 


11 


ski_pants 


35 


cash 


11 


brown_boots 


35 


cash 


11 


jackets 


35 


cash 



Pre-processing Step 2: Encoding Age. Since MINE RULE does not have an 
encoding operator for performing pre-processing tasks, we must discretize the 
interval values. 

Rules Extraction over a Set of Items and Customers’ Age. In MINE 
RULE, we specify that we are looking for rules associating one or more items 
(rule’s body) and customer’s age (rule’s head): 

MINE RULE SalesRB AS 

SELECT DISTINCT l..n item AS BODY, 1..1 customer_age AS HEAD, 
SUPPORT, CONFIDENCE 

FROM SalesView WHERE payment=’ credit_card’ (7) 

GROUP BY t_id 

EXTRACTING RULES WITH SUPPORT: 0.25, CONFIDENCE: 0.5 

If we want to store results in a database supporting the relational model, ex- 
tracted rules are stored into the table SalesRB{r_id, b_id, hJd, sup, conf) where 
r_id, b_id, hJ.d are respectively the identifiers assigned to rules, body itemsets 
and head itemsets. The body and head itemsets are stored respectively in tables 
SalesRB_B(bJd,item) and SalesRB _H{hSd, customer-age). Tables SalesRB, 
SalesRB-B and SalesRB-H are shown in Figure 2. 
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BodyJd 


item 


1 


ski_pants 


2 


hiking_boots 


3 


brown_boots 


4 


ski_pants 


4 


hinking_boots 



Rule -id 


Body -id 


Head-id 


Support 


Confidence 


1 


1 


5 


37.5% 


75% 


2 


2 


5 


50% 


100% 


3 


3 


5 


37.5% 


66% 


4 


4 


5 


37.5% 


100% 



Head -id 


customer-age 


5 


[20,29] 



Fig. 2. Normalized tables containing rules produced by MINE RULE in the first 
rule extraction phase 



Crossing-over: Looking for Exceptions in the Original Data. We want 
to find transactions of the original relation whose tuples violate all rules with 
ski_pants in the body. As rule components (bodies and heads) are stored in re- 
lational tables, we use an SQL query to manipulate itemsets. The corresponding 
query is the following: 

SELECT * FROM SalesView AS SI WHERE NOT EXISTS 
(SELECT * FROM SalesRB AS Rl, 

SaIesRB_B AS R1_B, SaIesRB_H AS R1_H 
WHERE Rl.b_id=Rl_B.b_id AND Rl ,h_id=Rl_H.h_id AND 
SI . customer_age=Rl_H. customer_age AND SI . item=Rl_B . item (8) 
AND EXISTS (SELECT * FROM SaIesRB_B AS R2_B 

WHERE R2_B.b_id=Rl_B.b_id AND R2_B . item=’ ski_pants ’ ) 

AND NOT EXISTS 

(SELECT * FROM SaIesRB_B AS R3_B 
WHERE Rl_B.b_id=R3_B.b_id AND NOT EXISTS 
(SELECT * FROM SalesView AS S2 
WHERE S2.t_id=Sl.t_id AND S2 . item=R3_B . item) ) ) 

This query is hard to write and to understand. It aims at selecting tuples of 
the original SalesView relation, renamed SI, such that there are no rules with 
ski_pants in the antecedent, that hold on them. These properties are verified 
by the first two nested SELECT clauses. Furthermore, we want to be sure that 
the above rules are satisfied by tuples belonging to the same transaction of the 
original tuple in SI. In other words, that there are no elements of the body 
of the rule that are not satisfied by tuples of the same original transaction. 
Therefore, we verify that each body element in the rule is satisfied by a tuple of 
the SalesView relation (renamed S2) in the same transaction of the tuple in SI 
we are considering for the output. 



Rules Extraction over Two Sets of Items. This is the classical example of 
extraction of association rules, formed by two sets of items. Using MINE RULE it 
is specified as follows: 
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MINE RULE SalesRB2 AS 

SELECT DISTINCT l..n item AS BODY, l..n item AS HEAD, 

SUPPORT, CONFIDENCE (9) 

FROM Sales2 
GROUP BY t_id 

EXTRACTING RULES WITH SUPPORT: 0.0, CONFIDENCE: 0.9 
In this simple toy database the result coincides with the one generated by 
MSQL. 

Post-processing Step 1: Manipulation of Rules. Once again, as itemsets 
corresponding to rule’s components are stored in tables {SalesRB-B, 
SalesRB_H), we can select rules having two items in the body with a simple 
SQL query. 

SELECT * FROM SalesRB AS R1 WHERE 2= (10) 

(SELECT C0UNT(*) FROM SaIesRB_B R2 WHERE R1 . b_id=R2 . b_id) 

Post-processing Step 2: Selection of Rules with a Maximal Body. We 

select rules with a maximal body for a given consequent. As rules’ components 
are stored in relational tables, we use again a SQL query to perform such a task. 

SELECT * FROM SalesRB AS R1 # We select the rules in R1 

WHERE NOT EXISTS # such that there are no 

(SELECT * FROM SalesRB AS R2 # other rules (in R2) with 

WHERE R2 .h_id=Rl .h_id # the same head, a different 

AND NOT R2 .b_id=Rl .b_id # body such that it has no 

AND NOT EXISTS (SELECT * # items that do not occur in 

FROM SaIesRB_B AS B1 # the body of the R1 rule 

WHERE Rl.b_id=Bl.b_id AND NOT EXISTS (SELECT * 

FROM SaIesRB_B AS B2 (11) 

WHERE B2.b_id=R2.b_id AND B2 . item=Bl . item) ) ) 

This rather complex query aims at selecting rules such that there are no 
rules with the same consequent and a body that strictly includes the body of 
the former rule. The two inner sub-queries are used to check that rule body in 
R1 is a superset of the rule body in R2. These post-processing queries probably 
could be simpler if SQL-3 standard for the ouput of the rules were adopted. 

Pros and Cons of MINE RULE. The first advantage of MINE RULE is that it has 
been designed as an extension to SQL. Moreover, as it perfectly nests SQL, it 
is possible to use classical statements to pre-process the data, and, for instance, 
select the subset of data to be mined. Like MSQL, data pre-processing is limited to 
operations that can be expressed in SQL: it is not possible to sample data before 
extraction, and the discretization must be done by the user. Notice however, that, 
by using the CLUSTER BY keyword, we can specify on which subgroups of a group 
association rules must be found. Like MSQL, MINE RULE allows the user to specify 
some constraints on rules to be extracted (on items belonging to head or body, 
their cardinality as well as more complex constraints based on the use of a tax- 
onomy). The interested reader is invited to read [12,13] to have an illustration of 
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these latter capabilities. Like MSQL, MINE RULE is essentially designed around the 
extraction step, and it does not provide much support for the other KDD steps 
(e.g., post-processing tasks must be done with SQL queries). Finally, according 
to our knowledge, MINE RULE is one of the few languages that have a well defined 
semantics [13] for all its operations. Indeed, it is clear that a clean theoretical 
background is a key issue to allow the generation of efficient optimizers. 

4.3 DMQL 

DMQL can work with traditional databases, so it can receive as input either 
the source relations Sales, Transactions and Customers shown in Figure 1 
or the view obtained by joining them and shown in Table 6. As already done 
with the examples on MINE RULE, let us consider that the view SalesView is 
given as input, so that the reader’s attention is more focused on the constraints 
that are strictly necessary for the mining task. 

Pre-processing Step 1: Selection of the Subset of Data to be Mined. 

Like MINE RULE, DMQL nests SQL for relational manipulations. So the selection 
of the relevant subset of data (i.e. clients buying products with their credit card) 
will be done via the use of the WHERE clause of the extraction query. 

Pre-processing Step 2: Encoding Age. DMQL does not provide primitives 
to encode data like MSQL. However, it allows us to define a hierarchy to specify 
ranges of values for customer’s age, as follows: 

define hierarchy age_hierarchy for customer_age on SalesView as 
levell :{min. . . 9}$<$IeveI0 : all 

Ievell:{10. . . 19}$<$IeveI0 : all (12) 

levell : {60 . . . 69}$<$level0 : all 
levell: {70. . .max}$<$levelO:all 

Rules Extraction over a Set of Items and Customers’ Age. DMQL allows 
the user to specify templates of rules to be discovered, called metapatterns, by 
using the matching keyword. These metapatterns can be used to impose strong 
syntactic constraints on rules to be discovered. So we can specify that we are look- 
ing for rule bodies relative to bought items and rule heads relative to customer’s 
age. Moreover, we can specify that we desire to use the predefined hierarchy for 
the age attribute. 

use database Sales db 

use hierarchy ageTiierarchy for customer_age 
mine association as SalesRB 

matching with item~^{X, {/}) customer jxge{X , A) (13) 

from SalesView 

where payment=‘credit_card’ 

group by t Jd 

with support threshold=25% 
with confidence threshold=50% 
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where the above metarule with the notation {/} matches with rules with 
repeated item predicate like item{X, Ii) A item{X, I2) ■ • • item{X, Ij) where {Ii, 
I 2 , ■ ■ ■ Ij}= I are different elements of the I set obtained as input by the WHERE 
predicate clause. The result is shown in Table 7. 

Table 7. Results produced by DMQL in the first rule extraction phase (SalesRB) 



item“^(X,{I}) 


customer _age(X, A) 


Support 


Confidence 


item(X,ski_pants) 


customer _age(X, 20... 29) 


37.5% 


75% 


item(X,hiking_boots) 


customer _age(X, 20... 29) 


50% 


100% 


item(X,brown_boots) 


customer _age(X, 20... 29) 


37.5% 


66% 


item(X,ski_pants)Aitem(X,hiking_boots) 


customer _age(X, 20... 29) 


37.5% 


100% 



Crossing-over: Looking for Exceptions in the Original Data. Like MINE 
RULE, DMQL does not provide support for crossing-over patterns and data: it 
requires SQL queries as already shown with MINE RULE (query (8)). 

Rules Extraction over Two Sets of Items. This phase is performed by the 
following DMQL statement: 
use database Sales db 
mine association as SalesRB2 

matching with item'^{X, {/}) item'^{X, { J}) (14) 

from Sales2 
group by t Jd 

with confidence threshold=90% 

Post-processing Step 1: Selection of the Rules with Two Items in the 

Body. Like MINE RULE, DMQL does not provide support for operations of rules 
manipulation. As we do not have direct access the rules and thus do not the 
exact storage format of rules, we make the assumption the rules are stored in 
the same way than in MINE RULE, and that allows us to compare the languages 
in the same conditions of storage format. So, for this step, an SQL query similar 
to query (10) shown in the examples of MINE RULE is therefore needed. 

Post-processing Step 2: Selection of the Rules with a Maximal Body. 

Like MINE RULE, DMQL does not provide support for operations of rules manipu- 
lation such as the selection of the most general rules. For the same reason as the 
previous post-processing step, an SQL query analogous to query (11) is therefore 
required. 

Pros and Cons of DMQL. Like MINE RULE, one of the main advantages of DQML 
is that it completely nests classical SQL, and so it is quite easy for a new user 
to learn and use the language. Moreover, DMQL is designed to work with tra- 
ditional databases and datacubes. Concerning the extraction step, DMQL allows 
to impose strong syntactic contraints on patterns to be extracted, by means 
of metapatterns allowing the user to specify the form of extracted rules. An- 
other advantage of DMQL is that we can include some background knowledge 
in the process, by defining hierarchies on items occurring in the database and 
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mining rules across different levels of hierarchies. Once rules are extracted, we 
can perform roll-up and drill-down manipulations on extracted rules. Clearly, 
analogously to the other languages studied so far, the main drawback of DMQL 
is that the language capabilities are essentially centered around the extraction 
phase, and the language relies on SQL or additional tools to perform pre- and 
post-processing operations. Finally, we can notice that, beyond association rules, 
DMQL can perform other mining tasks, such as classification. 

4.4 OLE DB DM 

OLE DB DM is designed for a simple use of relational data already available via 
OLE DB. Thus, it can work with relational data. Creating a view is not necessary 
because putting the data in the right format is exactly one of the purposes of 
the definition and population of the mining model. 

Pre-processing Step 1: Selection of the Subset of Data to be Mined. In 

the OLE DB DM framework, selection of data to be mined is done in the definition 
of the data mining model and in the following insertion of data in it. Conceptu- 
ally, it is very similar to the creation of a view. Here the mining model is named 
[SalesRB] in analogy to the previous examples for the other languages. 
Creation of the mining model: 

CREATE MINING MODEL [SalesRB] ( 

[trcUisaction_id] LONG KEY, 

[customer_age] LONG DISCRETIZED PREDICT, 

[items] TABLE ( 

[item] TEXT KEY 

) 

) 

USING [My_assoc_AIgo] (min_support=2, min_conf idence=0 . 5) 

Notice that we used a nested table [items] to specify bought items by a cus- 
tomer in a transaction and make reference to a mining algorithm, My_assoc_AIgo, 
for the extraction of association rules. 

Insertion of data in the data mining model: 

INSERT INTO [SalesRB] 

( [transaction_id] , [customer_age] , [items] ) 

SHAPE 

{SELECT [transaction_id] , [customer_age] 

FROM Transactions , Customers 

WHERE Transactions . customer=Customers . customer_id 
AND Transactions .payment="credit_card" 

APPEND ( 

{SELECT [item] FROM Sales 
ORDER BY [tr_id]} 

RELATE [transaction_id] TO [tr_id]) 

AS [items]} 
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Notice that selection of the interesting source data (purchases made by credit 
card) is done in this step. Notice also that APPEND keyword builds the nested table 
[items] containing items in source relation Sales purchased in a transaction. 
The relationship between the transaction identifier in Sales and the analogous 
identifier in the model is done by means of the RELATE keyword. 

Pre-processing Step 2: Encoding Age. The definition of the data mining 
model allows specification of discretized attributes and of discretization method 
used. However, discretization itself must be provided by the data mining algo- 
rithm provider. 

Rules Extraction over a Set of Items and Customer’s Age. In SQL 

Server 2000, no algorithm for association rule mining is currently available, but 
the specification of OLE DB DM claims that association rule mining algorithm can 
be supported. Here, we supposed that the user has implemented an association 
rule mining algorithm, named My_assoc_Algo, which takes as input parameters 
of minimal support and confidence and refers to the content of the [items] 
nested tables to elaborate association rules. 

The results of the association rule mining process could be stored by the al- 
gorithm in a relational table and described by the following PMML representation. 

<Item id="l" value="ski_pants" /> 

<Item id="2" value="hiking_boots" /> 

<Item id="3" value="brown_boots" /> 

<Itemset id="l" support="0 . 5" numberOf Items=" 1"> 

<ItemRef itemRef="l"> 

</Itemset> 

<Itemset id="2" support="0 . 5" numberOf Items=" 1"> 

<ItemRef itemRef="2"> 

</Itemset> 

<Itemset id="3" support="0 . 375" numberOf Items="l"> 

<ItemRef itemRef="3"> 

</Itemset> 

<Itemset id="4" support="0 . 375" numberOf Items="2"> 

<ItemRef itemRef="l" /> 

<ItemRef itemRef="2" /> 

</Itemset> 

<Item id="4" value=" [20,29] " /> 

<Itemset id="5" support="0 . 5" numberOf Items=" 1"> 

<ItemRef itemRef="4" /> 

</Itemset> 
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<AssociationRule support="0 . 375" conf idence="0 . 75" 
antecedent="l" consequent="5" /> 

<AssociationRule support="0 . 50" conf idence=" 1 . 0" 
antecedent="2" consequent="5" /> 

<AssociationRule support="0 . 375" conf idence="0 . 66" 
antecedent="3" consequent="5" /> 

<AssociationRule support="0 . 375" conf idence=" 1 . 0" 
antecedent="4" consequent="5" /> 



Notice that such a PMML description is very similar to the rules storage struc- 
ture of MINE RULE. 

Crossing-over: Looking for Exceptions in the Original Data. For this 
task, we must write a query in classical SQL. Since the association rules produced 
by the algorithm could be stored in the PMML format, which is quite close of the 
storage format of MINE RULE, we can say that the query will be very similar to 
the one used with MINE RULE. 

Concerning post-processing tasks, or the usage of the rules after their proper 
extraction, notice that OLE DB DM only provides some facilities for prediction, 
with PREDICTION JOIN. However, this is not useful here. 

Rules Extraction over Two Sets of Items. We want to perform a new 
mining task here, so we must define a new mining model. This one is analogous 
to the model used in previous step with the exception of customers’ age that is 
not needed in this case. Indeed, the difference of this mining task with respect to 
previous one lies in the proper execution of the mining algorithm that associates 
an itemset to another itemset and not to the customers’ age. For sake of space 
we do not report this new model here. 

Post-processing Step 1: Manipulation of Rules. Here again, we need to 
access rules’ components. Since the OLE DB DM suggests that bodies and heads 
of rules are stored following the PMML format, the query will be very similar to 
the one used with MINE RULE. 

Post-processing Step 2: Selection of Rules with a Maximal Body. Again, 
since the rules could be stored following the PMML format, we can use the same 
kind of queries used for MINE RULE. 

Pros and Cons of OLE DB for DM. The first advantage of OLE DB DM is that it 
is a first temptative of industrial standard and that it begins to be implemented 
in some commercial application (like SQL Server 2000). It is designed as an ex- 
tension to SQL and so a DBA can write queries that are similar to classical SQL 
queries and that define and populate data mining models. But, the main prob- 
lem is that the language of OLE DB DM is not really a language for Data Mining 
like the other three. It is particularly targeted at making the communication 
between relational databases and data mining algorithms easier. So it can work 
with a lot of different algorithms, provided that the algorithms are compliant 
to OLE DB DM mining model. However, it provides no facilities to handle typical 
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constraints of the association rule mining problem, such as constraints on items, 
frequency and confidence. More generally, all these types of constraints must be 
given as parameters to the mining algorithm. Moreover, accessing the mining 
results and browsing of extracted patterns must be managed by the algorithm 
provider, which makes a general method for post-processing difficult to define. 
Finally, there is no formal semantics like in MINE RULE. 



5 Conclusions 

We have considered three languages, MSQL, MINE RULE and DMQL and an API for 
data mining, OLE DB DM, with an SQL-like language for the deployment of a data 
mining model. All of them request the extraction from a relational database of 
data mining patterns, and in particular of association rules. They satisfy the 
“closure property”, a crucial property for inductive databases. We have com- 
pared the various features of these languages with the desired properties of an 
ideal query language for inductive databases dedicated to association rules. We 
have prepared a benchmark and tested the languages against it. The benchmark 
is constituted by an hypothetical KDD scenario, taken from the data mining 
practice, in which we have formulated a collection of queries. We have tested the 
possibility and the ease for the user to express the chosen queries in the above 
mentioned languages. The outcome is that no language presents all the desired 
properties. MSQL seems the one that offers the larger number of primitives tai- 
lored for post-processing and an on-the-fly encoding, specifically designed for 
efficiency. DMQL allows the extraction of different data patterns, the definition 
and use of hierarchies, and some visualization primitives. MINE RULE is the only 
one that allows to dynamically partition the source relation into a first and a sec- 
ond level grouping (the clusters) from which more sophisticated rule constraints 
can be applied. Furthermore, to the best of our knowledge, it looks as the only 
language having an algebraic semantics, an important factor for an in-depth 
study of optimization issues. OLE DB DM is an API, that allows any application 
to access by means of SQL-like queries to a relational data source, and to be cou- 
pled with specialized mining algorithms. The main motivation of the design of 
OLE DB DM is to ease the communication between a data mining application, the 
DBMS providing data and a set of available, advanced data mining algorithms. 
However, at the moment, it does not provide any specific feature tailored to any 
particular data mining task that is not predictive. 

However, it is clear that one of the main limits of all the proposed languages 
is the weak support of rule post-processing. In particular, in all the languages 
post-processing capabilities are limited to a few predefined built-in primitives. 
Instead, it would be desirable that the grammar of the languages would accept 
a certain degree of extensibility. Indeed, for instance, it is not possible to intro- 
duce user-defined functions in the statements. These ones would allow the user 
to provide the implementation of user-defined sophisticated constraints, based, 
for instance, on new pattern evaluation measures. 
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Furthermore, the research on condensed representations for frequent itemsets 
[2,3] has been proved useful not only for mining frequent itemsets and frequent 
association rules from dense databases but also for sophisticated post-processing 
[1,15]. Indeed, one of the problems in association rule mining from real-life data 
is the huge number of extracted rules. However, many of the rules are redundant 
and might be useless. Thus, a condensed representation would help visualizing 
the result and focusing the user attention on the relevant rules. For example, 
Bastide et al., [1], presents an algorithm to extract a minimal cover of the set of 
frequent association rules. 

Another crucial issue relative to query language for data mining is the opti- 
mization for sequences of queries (e.g., deciding of query containment). To the 
best of our knowledge, the materialization of condensed representations of the 
frequent itemsets seems to be quite useful [9,4] but still needs further work. 

Last but not least, an important issue is the simplicity of the language and 
its ease of use. Indeed, we think that a good candidate language for data mining 
should be flexible enough to specify a variety of different mining tasks in a 
declarative fashion. To the best of our knowledge, the implementation of these 
languages tackles the mentioned problems (including the lack of instruments 
dedicated to post-processing) by being embedded in a data mining system, which 
provides a graphical front end to the language. 
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Abstract. Researchers convincingly argue that the ability to declar- 
atively mine and analyze relational databases using SQL for decision 
support is a critical requirement for the success of the acclaimed data 
mining technology. Although there have been several encouraging at- 
tempts at developing methods for data mining using SQL, simplicity 
and efficiency still remain significant impediments for further develop- 
ment. In this article, we propose a significantly new approach and show 
that any object relational database can be mined for association rules 
without any restructuring or preprocessing using only basic SQL3 con- 
structs and functions, and hence no additional machineries are necessary. 
In particular, we show that the cost of computing association rules for a 
given database does not depend on support and confidence thresholds. 
More precisely, the set of large items can be computed using one simple 
join query and an aggregation once the set of all possible meets (least 
fixpoint) of item set patterns in the input table is known. We believe that 
this is an encouraging discovery especially compared to the well known 
SQL based methods in the literature. Finally, we capture the function- 
ality of our proposed mining method in a mine by SQL3 operator for 
general use in any relational database. 



1 Introduction 

In recent years, mining association rules has been a popular way of discovering 
hidden knowledge from large databases. Most efforts have focused on developing 
novel algorithms and data structures to aid efficient computation of such rules. 
Despite major efforts, the complexity of the best known methods remain high. 
While several efficient algorithms have been reported [1,4,10,22,19,12,21,18,23], 
overall efficiency continues to be a major issue. In particular, in paradigms other 
than association rules such as ratio rules [11], chi square method [3], and so on, 
efficiency remains one of the biggest challenges. 

The motivation, importance, and the need for integrating data mining with 
relational databases has been addressed in several articles such as [16,17]. They 
convincingly argue that without such integration, data mining technology may 
not find itself in a viable position in the years to come. To be a successful 
and feasible tool for the analysis of business data in relational databases, such 
technology must be made available as part of database engines as well as part 
of its declarative query language. 
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While research into procedural computation of association rules has been 
extensive, fewer attempts have been made to use relational machinery or SQL 
for declarative rule discovery barring a few exceptions such as [13,25,21,8,20,14]. 
Most of these works follow an apriori like approach by mimicking its function- 
ality and rely on generating candidate sets and consequently suffer from high 
computational overhead. While it is certainly possible to adapt any of the various 
procedural algorithms for rule mining as a special mining operator, the opportu- 
nity for using existing technology and constructs is preferable if it proves to be 
more beneficial. Some of the benefits of using existing relational machinery may 
include opportunity for query optimization, declarative language support, selec- 
tive mining, mining from non-transactional databases, and so on. From these 
standpoints, it appears that research into data mining using SQL or SQL-like 
languages bear merit and warrant attention. But before we proceed any further, 
we would like to briefly summarize the concept of association rules as follows for 
the readers unfamiliar with the subject. 

Let I = {fi,Z 2 , • ■ • ,im} be a set of item identifiers. Let T be a transaction 
table such that every tuple in T is a pair, called the transaction, of the form 
{tid,X) such that tid is a unique transaction ID and X C I is a set of item 
identifiers (or items). A transaction is usually identified by its transaction ID 
tid, and said to contain the item set X. An association rule is an implication 
of the form X ^ Y, where X,Y C X, and X HY = 0. Association rules are 
assigned a support (written as 6) and confidence (written as rj) measure, and 
denoted X ^ Y{6,r]) . The rule X ^Y has a support 6, denoted sup(X Y), 
in the transaction table T if 6% of the transactions in T contain XUY. In other 
words, sup{X ^ y) = sup{X U F) = ^ = ^ / C J is a 



set of items. On the other hand, the rule A ^ F is said to have a confidence 
rj, denoted con{X F), in the transaction table T if rj% of the transactions 
in T that contain X also contains F. So, the confidence of a rule is given by 
con{X ^Y)=rj= ^'’upixV ' 

Given a transaction table T, the problem of mining association rules is to 
generate a set of quadruples TZ (a table) of the form {X, Y, 6, rf) such that A, F C 
X, AnF = %, d > 5m, and r] > ijm, where 5m and ijm are user supplied minimum 
support and confidence thresholds, respectively. The clarity of the definitions and 
the simplicity of the problem is actually deceptive. As mentioned before, to be 
able to compute the rules TZ, we must first compute the frequent item sets. A 
set of items A is called a frequent item set if its support 5 is greater than the 
minimum support 5m- 



1.1 Related Research 

Declarative computation of association rules were investigated in works such as 
[13, 25, 9, 21, 8, 20, 14, 7,5]. Meo et al. [14] proposes an SQL like declarative query 
language for association rule mining. The language proposed appears to be too 
oriented towards transaction databases, and may not be suitable for general 
association rule mining. It is worth mentioning that association rules may be 
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computed for virtually any type of database, transaction or not. In their ex- 
tended language, they blend a rule mine operator with SQL and other additional 
features. The series of research reported in [25,21,20] led by IBM researchers, 
mostly addressed the mining issue itself. They attempted to compute the large 
item sets by generating candidate sets testing for their admissibility based on 
their MC model, combination, and GatherJoin operators. Essentially, these works 
proposed a method for implementing apriori using SQL. In our opinion, by try- 
ing to faithfully copy a procedural concept into a declarative representation they 
retain the drawbacks and inefficiencies of apriori in the model. 

The mine rule operator proposed in [13] is perhaps the closest idea to ours. The 
operator has significant strengths in terms of expressive power. But it also re- 
quires a whole suit of new algebraic operators. These operators basically simulate 
the counting process using a set of predefined functions such as Count AllGroups, 
Make Cluster Pairs, ExtractBodies, and ExtractRules. These functions use a fairly 
good number of new operators proposed by the authors, some of which use loop- 
ing constructs. Unfortunately, no optimization techniques for these operators are 
available, resulting in doubts, in our opinion, about the computational viability 
of this approach. 

In this article, we will demonstrate that there is a simpler SQL3 expression 
for association rule mining that does not require candidate generation such as in 
[25,21,20] or any implementation of new specialized operators such as in [13,14]. 
We also show that we can simply add an operator similar to cube by operator 
proposed for data warehousing applications with an optional having clause to 
facilitate filtering of unwanted derivations. The striking feature of our proposal is 
that we can exploit the vast array of optimization techniques that already exists 
and possibly develop newer ones for better performance. These are some of the 
advantages of our proposal over previous research in addition to its simplicity 
and intuitive appeal. 

1.2 Contributions of this Article and Plan for the Presentation 

We summarize the contributions of this article as follows to give the reader 
an idea in advance. We present a different view of transaction databases and 
identify several properties that they satisfy in general in section 2. We exploit 
these properties to develop a purely SQL3 based solution for association rule 
mining that uses the idea of least fix point computation. We rely on SQL3 
standard as it supports complex structures such as sets and complex operations 
such as nesting and unnesting. We also anticipate the availability of several set 
processing functions such as intersect (n) and setminus (\), set relational 
operators such as subset (c) and superset (d), nested relational operations 
such as nest by, etc. Finally, we also exploit SQL3’s create view recursive and with 
constructs to implement our least fixpoint operator for association rule mining. 

We follow the tradition of separating the large item set counting from actual 
mining and propose two operators - one to compute the large item sets from 
a source table, another one to compute the rules from the large item sets. We 
provide optional mechanisms to specify support and confidence thresholds and a 
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few additional constraints that the user may wish the mining process to satisfy. 
We also define a single operator version of mine by operator to demonstrate that 
it is possible to do so even within our current framework, even though we prefer 
the two stage approach. 

The other implicit contribution of our proposal is that it opens up the op- 
portunity for query optimization, something that was not practically possible 
until now in mining applications. Finally, it is now possible to use any relational 
database for mining in which one need not satisfy input restrictions similar to 
the ones that various mining algorithms require. Consequently, the developments 
in this article eliminates the need for any traditional preprocessing of input data. 

In section 3, we present a discussion on a set theoretic perspective of data 
mining problem. This discussion builds upon the general properties of trans- 
action tables presented in section 2. In this section, we demonstrate through 
illustrative examples that we can solve the mining problem just using set, lattice 
and aggregate operations if we adopt the idea of the so called non-redundant 
large item sets. Once the problem is understood on intuitive grounds, the rest of 
the development follows in fairly straightforward ways. In section 4 we present 
a series of SQL3 expressions that capture the spirit of the procedure presented 
in section 3. One can also verify that these expressions really produce the so- 
lution we develop in the illustrative example in this section. We then discuss 
the key idea we have exploited in developing the solution in section 5. The min- 
ing operator is presented in section 6 that is an abstraction of the series of 
SQL3 expressions in section 4. Several optimization opportunities and related 
details are discussed in section 7. Before we conclude in section 9, we present 
a comparative analysis of our method with other representative proposals in 
section 8. 



2 Properties of Transaction Tables 

In this section, we identify some of the basic properties shared by all transac- 
tion tables. We explain these properties using a synthetic transaction table in 
relational data model [24]. In the next section, we will introduce the relational 
solution to the association rule mining problem. 

Let X be a set of items, V{J) be all possible item sets, T be a set of identifiers, 
and 6m be a threshold. Then an item set table S with scheme {Tid, Items} is 
given by 

S CT X V{I) 

such that TO = |5|. An item set table S is admissible if for every tuple t € S, 
and for every subset s G V{t[Items]), there exists a tuple t' G S such that 
s = t' [Items]. In other words, every possible subset of items in a tuple is also 
a member of S. The frequency table of an admissible item set table S can be 
obtained as 



A — Items I?count(*) 
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which has the scheme {Items, Count}. A frequent item set table Ff is a set of 
tuples that satisfies the count threshold property as follows. 

Ay ^Count 

Ff satisfies some additional interesting properties. Suppose I = t[Items] is 
an item set for any tuple t G Ff. Then, for any X,Y C I, there exists ti and t2 in 
Ff such that ti[Items] = X, t2[Items] = Y, ti[Count] > t[Count], t2[Count] > 
t[Count], and t[Count] < min{ti[Counf\,t2[Count]). The converse, however, is 
not true. That is, for any two tuples ti and t2 in Ff, it is not necessarily true 
that there exists a tuple t G Ff such that t[Items] = ti[Items]Ut2[Items]. But if 
such a t exists then the relation t[Count] < min{ti[Count],t2[Count]) is always 
true. Such a relationship is called anti-transitive. 

The goal of the first stage of apriori like algorithms has been to generate the 
frequent item set table described above from a transaction table T. Note that 
a transaction table, as defined above, is, in reality, not admissible. But the first 
stage of apriori mimics admissibility by constructing the k item sets at every 
fcth iteration step. 

Once the frequent item set table is available, the association rule table TZ can 
be computed as^ 

TZ = n Ff .Count Ff Count 

Ff^ .Items, Ff.^ .Items\Ff^ .Items, , f/^. Count 

^{^Fff.ItemsCFf,^.Items(.^fi ^ ^^ 2 )) 



This expression, however, produces all possible rules, some of which are even 
redundant. For example, let a b(^, and ab c(^^, be two rules 

^ ’ ' m ’ Sa ' ^ ' m ’ Sab ' 

discovered from Ff, where sx and m represent respectively the frequency of an 
item set X in the item set table (i.e., t G Ff, and t[Count] = ^), and number 
of transactions in the item set table S. Then it is also the case that TZ contains 
another rule (transitive implication) a bc{^^, Notice that this last 

rule is a logical consequence of the first two rules that can be derived using the 
following inference rule, where X,Y, Z C X are item sets. 



X ^ Y{^ 



Sx ' 



XUY ^ Z{i 



X ^YUZ{^ 



SXTjYLFZ ' 
SX ' 



SXUYUZ 

SX\JY 



Written differently, using only symbols for support {6) and confidence (?]), 
the inference rule reads as follows. 



X^Y{6i,i^i) XUY^Z{ 62 ,m) 

X ^Y\J Z{ 82 , ri\ * t]2) 



Formally, \i X,Y, Z G1 I be sets of items, and X Y{6i,rjx), X UY ^ 
^{62,112) and X ^ Y U Z{6^,r]-i) hold, then we say that X ^ Y U Z{6^,ri-i) 
is an anti- transitive rule. Also, if r = X ^ F(^, r;) be a rule, and 5 m and rjm 
respectively be the minimum support and confidence requirements, then r is 
redundant if it is derivable from other rules, or if ^ or 77 < rjm- 



^ Assuming that two copies of F/ are available as and F/2. 
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It is possible to show that for any minimum support and confidence thresholds 
6m and rjm respectively, if the rules X ^ y(^i, ryi), and XUY ^ Z (62,112) hold, 
then the rule X U U ^(^3,773) also holds such that 63 = 62 > 6m, and 
% = t/i * 772 < min(ri2,r]i)- Notice that 773 = 771 * 772 could be less then the 
confidence threshold rjm, even though 771 > rjm and 772 > rjm- In other words, 
771 > Vm A 772 > 77m 7^ 771 * 772 > rjm- Furthermore, we can show that any rule 
r = X —>■ Y(6,r]) is redundant if it is anti-transitive. It is interesting to observe 
that the redundancy of rules is a side effect of redundancy of large item sets. 
Intuitively, for any given pair of large item sets h and I2, h is redundant if 
l\ C I2, and the support of h is equal to the support si^ of l2- Otherwise, 
it is non-redundant. Intuitively, h is redundant because its support si-^ can be 
computed from si^ just by copying. A more formal treatment of the concept of 
large itemsets and redundant large itemsets may be found in section 5 . 1 . 

Since in the frequent item set table, every item set is a member of a chain that 
differs by only one element, the following modification for TZ will compute the 
rules that satisfies given support and confidence thresholds and avoids generating 
all such redundant rules^. 



^ — ^Conf>7]rn (Pr{Ant^Cons^Sup,Conf) 

x(n Ff Count Ff„. Count 

Ff^ .Items, Ff^ .Items\Ff^ .Items, — ^ ^ 



L£2_ 

Oi- 



^ i^Ff, .ItemsCFf^ .Items A{\Ff^ .Items\ — \Ff, ,Items\ = l) {Pfi ^ ^f 2 ) ))) 



2.1 The Challenge 

The preceding discussion was aimed to demonstrate that a relational computa- 
tion of association rules is possible. However, we used an explicit generation of 
the power set of the items in X to be able to compute the frequent item set table 
Ff from the item set table S. This is a huge space overhead, and consequently, 
imposes a substantial computational burden on the method. Furthermore, we 
required that the item set table S be admissible, another significant restriction 
on the input transaction table. These are just some of the difficulties faced when 
a set theoretic or relational characterization of data mining is considered. The 
procedurality involved acts as a major bottleneck. So, the challenge we undertake 
is to admit any arbitrary transaction table, yet be able to compute the associa- 
tion rules “without” explicit generation of candidate item sets from a relational 
database, and compute the relation TZ as introduced before using existing SQL 3 
constructs and machineries. 

3 A Set Theoretic Perspective of Data Mining 

In this section, we present our idea of a SQL mine operator on intuitive grounds 
using a detailed example. The expectation here is that once we intuitively under- 
stand the issues related to the operator, it should be relatively easier to follow 



^ p is a relation renaming operator defined in [24]. 
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the technical developments in the later sections. Also, this simple explanation 
will serve as the basis for a more general relational mining operator we plan to 
present at the end of this article. 

Consider a database, called the transaction table, T as shown in figure 1. 
Following the traditional understanding of association rule mining, and also from 
the discussion in section 2, from the source table T we expect to obtain the large 
item set table (Ltable) and the rules table (r_table) shown in figure 1 below once 
we set the support threshold at 25%. The reasoning process of reaching to the 
large item set and rules tables can be explained as follows. 



large item set table 



transaction table association rules table 

Fig. 1. Source transaction database T is shown as t_table, large item set table 
as Ltable, and finally the association rules as r_table 

We can think of T as the set of complex tuples shown in nested table (n_table) 
in figure 2 once we nest the items on transaction numbers. If we use a group by 
on the Items column and count the transactions, we will compute the frequency 
table (Ltable) in figure 2 that will show how many times a single item set pattern 
appears in the transaction table (t_table) in figure 1. Then, let us assume that 
we took a cross product of the frequency table with itself, and selected the rows 
for which 

• the Items column in the first table is a proper subset of the Items column 
in the second table, and finally projected out the Items column of the first 
table and Support column of the second table^, or 

• the Items columns are not subset of one another, and we took the intersection 
of the Items of both the tables, created a new relation (intTable, called the 



r_table 



Ant 


Cons 


Support 


Conf 


{b} 


{c} 


0.38 


0.60 


{f} 


{b} 


0.29 


0.66 


{b} 


{f} 


0.29 


0.40 


{b,c} 


{a} 


0.29 


0.66 



Ltable 



Items 


Support 


{a, b, c} 


.29 


{b, f} 


.29 


{b, c} 


.38 


{f} 


.43 


{d} 


.29 


{b} 


.71 



t_table 



Tranid 


Items 


tl 


a 


tl 


b 


tl 


c 


t2 


b 


t2 


c 


t2 


f 


ts 


b 


tz 


f 


t4 


a 


t4 


b 


t4 


c 


tz 


b 


tz 


e 


te 


d 


te 


f 


tr 


d 



This will give us < {6, /}, 1 > and < {d}, 1 >. 
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n_table 



Tranid 


Items 




{a,b,c} 


t2 


{b,c,f} 


ts 


{b,f} 


u 


{a,b,c} 


^5 


{b,e} 


te 


{d,f} 


tr 


{d} 



nested table 



Liable 



Items 


Support 


{a, b, c} 


2 


{b,c,f} 


1 


{b, e} 


1 


{b,f} 


1 


{d} 


1 


{d,f} 


1 



frequency table 



i_table 



Items 


Support 


{b,c} 


3 


{b,f} 


1 


{b} 


5 


{f} 


3 


{d} 


1 



inheritance table 



c_table 



Items 


Support 


{a, b, c} 


2 


{b,c,f} 


1 


{b,c} 


3 


{b, f} 


2 


{b, e} 


1 


{d, f} 


1 


{b} 


5 


{f} 


3 


{d} 


2 



count table 



Fig. 2. n_table: t_table after nesting on Tranid, Ltable: n_table after grouping 
on Items and counting, i .table: generated from Ltable, and c.table: grouping on 
Items and sum on Support on the union of i .table and Ltable 



intersection table) with distinct tuples of such Items with Support 0, and 
then finally computed the support counts as explained in step 1 now with 
the frequency table and intersection table^. 

This will give us the inheritance table (i.table) as shown in figure 2. Finally, 
if we took a union of the frequency table and the inheritance table, and then do 
a group by on the Items column and sum the Support column, we would obtain 
the count table (c.table) of figure 2. 

The entire process of large item set and association rule generation can be 
conveniently explained using the so called item set lattice found in the litera- 
ture once we enhance it with some additional information. Intuitively, consider 
placing the transactions with item set u appearing in the frequency table with 
their support count t as a node in the lattice C as shown in figure 3. Notice that 
in the lattice, each node is represented as u*, where it denotes the fact that u 
appears in exactly t transactions in the source table, and that u also appears as 
a subset of other transactions n number of times such that c = n + t. t is called 



^ The result of this will be tuple < {6, c},3 >, < {6}, 5 >, and < {/},3 > in this 
example. Note that the intersection table will contain the tuples < {6, c},0 >, < 
{6},0 >, and < {/}, 0 >, and that these patterns are not part of the frequency table 
in figure 2. The union of step 1, and step 2 processed with the intersection table will 
now produce the inheritance table in figure 2. 
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Fig. 3. Lattice representation of item set database T 



the transaction count, or frequency count, and c is called the total count of item 
set u. The elements or nodes in £ also satisfy additional interesting properties. 
A node v at level I differs from its child u at level Z — 1 by exactly 1 element, 
and that u G v. For any two children u and w of a node v, v = uU w. For any 
two nodes® and at any level I, their join is defined as {u fl v)cj, and the 
meet as (u U such that Cj < min{cu, c„) and Cm > max{cu, Cy). 

Note that in figure 3, the nodes marked with a solid rectangle are the nodes 
(or the item sets) in T, nodes identified with dotted rectangles are the intersec- 
tion^ nodes or the virtuaf nodes, and the nodes marked with ellipses (dotted or 
solid) are redundant. The nodes below the dotted line, called the large item set 
envelope, or l-envelope, are the large item sets. Notice that the node be is a large 
item set but is not a member of T, while bef, df and be are in T, yet they are 
not included in the set of large item sets of T. We are assuming here a support 
threshold of 25%. So, basically, we would like to compute only the nodes abc, 
be, bf, b, d and / from T. This set is identified by the sandwich formed by the 
1-envelope and the zero-envelope, or the z-envelope, that marks the lowest level 
nodes in the lattice. If we remove the non-essential, or redundant, nodes from 
the lattice in figure 3, we are left with the lattice shown in figure 4. It is possible 
to show that the lattice shown in figure 4 is the set of non-redundant large item 
sets of T at a support threshold 25%. The issue now is how to read this lattice. 
In other words, can we infer all the large item sets that an apriori like algorithm 
will yield on T? The answer is yes, but in a somewhat different manner. This is 
demonstrated in the following way. 

Notice that there are five large 1-items - namely a, b, c, d and /. But only 
three, b, d and /, are listed in the lattice. The reason for not listing the other 
large 1-items is that they are implied by one of the nodes in the lattice. For 



® For any node u, the notations and Cy mean respectively the transaction count 
and total count of u. 

® Nodes that share items in multiple upper level nodes and have a total count higher 
than any of the upper level nodes. 

^ Nodes with itemsets that do not appear in T, and also have total count equal to all 
the nodes above them. 
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Fig. 4. Non-redundant large item sets of T when 6m = 0.25 

example, c is implied by be for which the count is 3. The nodes b and be should 
be read as follows - b alone appears in T 5 times, whereas be appears 3 times. 
Since c appears a maximum of 3 times with b (2 times in abe and 1 time in bef 
actually), its total count can be derived from be's count. Similarly, a’s count can 
be derived from abe - 2. Hence, the lattice in figure 4 does not include them and 
considers them as redundant information. This view point has another impor- 
tant implication. We are now able to remove (or prune) redundant association 
rules too. We will list b — > c(.38, .60) and be a(.29, .66), among several oth- 
ers, as two association rules that satisfy the 25% support and 40% confidence 
thresholds. Notice that we do not derive the rule b — s- ac(. 29, .40) in particu- 
lar. The reason is simple - it is redundant for it can be derived from the rules 
b c(.38, .60) and be a(.29, .66) using the following inference rule. Notice 
that if we accept the concept of redundancy we propose for rules, computing 
b — > ac(. 29, .40) does not strengthen the information content of the discovery in 
any way. 

X^Y{ 6 i,r]i) XUY ^ Z{ 62 ,m) & ^ c(.38, .60) 5c ^ o(.29, .66) 

X ^YU Z{ 62 ,m*m) ~ 5^ ac(.29,.40) 

Finally, we would like to point to an important observation. Take the case 
of the rules 5 ^ /(-29, .40) and / ^ 5(.29, .66). These two rules serve as an im- 
portant reminder that X — > y(si,ci), and Y — > X{s 2 ,C 2 ) ^ c\ = C2, and that 
X — > F(si,ci), and Y — > X(s2,C2) => si = S2- But in systems such as apriori, 
where all the large item sets are generated without considering redundancy, it 
would be difficult to prune rules based on this observation as we do not know 
which one to prune. For example, for the set of large item sets {503,63,03}, 
we must derive rules 5 — > c()^,l) and c — > 5()^,1)® and cannot prune them 
without any additional processing. Instead, we just do not generate them at 
all. 



® Assuming there are m number of transactions. 
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4 Computing Item Set Lattice Using SQL3 

Now that we have explained what non-redundant large item sets and associa- 
tion rules mean in our framework, we are ready to discuss computing them using 
SQL. The reader may recall from our discussion in the previous section that we 
have already given this problem a relational face by presenting them in terms 
of (nested) tables. We will now present a set of SQL3 sentences to compute the 
tables we have discussed earlier. We must mention here that it is possible to eval- 
uate the final table in figure 1 by mimicking the process using a lesser number of 
expressions than what we present below. But we prefer to include them all sepa- 
rately for the sake of clarity. In a later section, we will discuss how these series of 
SQL sentences can be replaced by an operator, the actual subject of this article. 

For the purpose of this discussion, we will assume that several functions that 
we are going to use in our expressions are available in some SQL3 implementa- 
tion, such as Oracle, DB2 or Informix. Recall that SQL3 standard requires or 
implies that, in some form or other, these functions are supported®. In partic- 
ular, we have used a nest by clause that functions like a group by on the listed 
attributes, but returns a nested relation as opposed to a first normal form rela- 
tion returned by group by. We have also assumed that SQL3 can perform group 
by on nested columns (columns with set values). Finally, we have also used set 
comparators in where clause, and set functions such as intersect and setmi- 
nus in the select clause, which we think are natural additions to SQL3 once 
nested tuples are supported. As we have mentioned before, we have, for now, 
used user defined functions (UDFs) by treating set of items as a string of labels 
to implement these features in Oracle. 

The following two view definitions prepare any first normal form transaction 
table for the mining process. Note that these view definitions act as idempotent 
functions on their source. So, redoing them does not harm the process if the 
source table is already in one of these forms. These two views compute the 
n_table and the Ltable of figure 2. 

create view notable as 
(select Tranid, Items 
from t-table 
nest by Tranid) 

create view f-table as 

(select Items, count (*) as Support 
from nJable 
group by Items) 

Before we can compute the i_table, we need to know what nodes in the 
imaginary lattice will inherit transaction counts from some of the transaction 



® Although some of these functions are not supported right now, once they are, we 
will be in a better shape. Until then, we can use PL/SQL codes to realize these 
functions. 
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nodes in the lattice - Support value of Items in the Ltable. Recall that nodes 
that are subset of another node in the lattice, inherit the transaction count of 
the superset node towards its total count. We also know that only those (non- 
redundant) nodes which appear in the Ltable, or are in the least fixpoint of the 
nodes in Ltable will inherit them. So, we compute first the set of intersection 
nodes implied by Ltable using the newly proposed SQL3 create view recursive 
statement as follows. 

create view recursive inLtable as 

((select distinct intersect ft. Rems, p. Items), 0 
from f_table as t, f-table as p 
where t.Items (f. p. Items and p. Items (f. t. Items 
and not exists 
(select * 

from fJable as / 

where /.Items = intersect (f. Rems, p. Items))) 

union 

(select distinct intersect ft. Rems, p. Items), 0 
from int-table as t, int-table as p 
where t.Items p. Items and p. Items (/ t.Items 
and not exists 
(select * 

from f_table as / 

where /Items = intersect (f. Rems, p. Items)))) 

We would like to mention here again that we have implemented this feature 
using PL/SQL in Oracle. Notice that we did not list the int_table we create 
below in figure 1 or 2 because it is regarded as a transient table needed for the 
computation of i_table. 

It is really important that we create only distinct set of intersection items and 
only those ones that do not appear in the Ltable for the purpose of accuracy 
in support counting. Take for example three transactions in a new frequency 
table, fLtable, represented as {abc\,bcd\,bc/Q,bc\}. Assume that we compute 
the set of intersections of the entries in this table. If we do not guard against the 
cautions we have mentioned, we will produce the set {6 cq, &Cq, 6cq} using the view 
expression for int_table - which is not desirable. Because, these three will inherit 
Support from {a&cj, bcd^, bc/g} giving a total count of 10, i.e., bc\Q. The correct 
total count should have been bc\. If we just ensure the uniqueness of a newly 
generated item set (but not its absence in the Ltable) through meet computation, 
we still derive {6cg} instead of an empty set, which is also incorrect. This means 
that not including the following condition in the above SQL expression will be 
a serious mistake. 

not exists (select * 
from fdable as / 

where /.Items = intersect (f. Rems, p. Items) 
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Once we have computed the int_table, the rest of the task is pretty simple. 
The i_table view is computed by copying the Support of a tuple in Ltable for any 
tuple in the collection of Ltable and int_table which is a subset of the tuple in 
the Ltable. Intuitively, these are the nodes that need to inherit the transaction 
counts of their ancestors (in Ltable). 

create view Liable as 

(select t. Items, p. Support 
from f-table as p, 

((select * 
from fJable) 
union 
(select * 

from int-table)) as t, 
where t.Items C p. Items) 

From the i_table, a simple grouping and sum operation as shown below will 
give us the count table, or the c_table, of figure 2. 

create view c-table as 

(select t.Items, sum{t. Support) as Support 
from ((select * 
from fJable) 
union 
(select * 

from Liable)) as t 
group by t.Items) 

The large item sets of Ltable in figure 1 can now be generated by just selecting 
on the c_table tuples as shown next. Notice that we could have combined this 
step with the c_table expression above with the help of a having clause. 

create view Ltable as 
(select Items, Support 
from C-table 
where Support > 6m) 

Finally, the (non-redundant) association rules of figure 1 are computed us- 
ing the r_table view below. The functionality of this view can be explained 
as follows. Two item sets u[Items] and v[Items] in a pair of tuple u and v 
in the Ltable implies an association rule of the form u[Items] —>■ v[Items] \ 
u[I terns] {v[Support], only if u[Items] C v[Items] and there does not 

exist any intervening item set x in the Ltable such that a: is a superset of u[Items] 
and is a subset of v[Items] as well. In other words, in the lattice, v[Items] is 
one of the immediate ancestors of u[Items]. In addition, the ratio of the Sup- 
ports, for example, ™^st be at least equal to the minimum confidence 

threshold pm. 
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create view ratable as 

(select a. Items, c.Items\a. Items, c. Support, c. Support/a. Support 
from Ltahle as a, Ltable as c 

where a. Items C c. Items and c.Items / a.Items > rjm and not exists 
(select Items 
from Ltable as i 

where a.Items C i. Items and i.Items C c.Items)) 

The readers may verify that these are the only “generic” SQL3 expressions 
(or their equivalent) that are needed to mine any relational database (assuming 
proper name adaptations for tables and columns). The essence of this relational 
interpretation of the problem of mining, as demonstrated by the SQL3 expres- 
sions above, is that we do not need to think in terms of iterations, candidate 
generation, space time overhead, and so on. Instead, we can now express our 
mining problems on any relational database in declarative ways, and leave the 
optimization issues with the system and let the system process the query us- 
ing the best available method to it, recognizing the fact that depending on the 
instance of the database, the choice of best methods may now vary widely. 



5 An Enabling Observation 

Level wise algorithms such as apriori essentially have three distinct steps at each 
pass k: (i) scan the database and count length k candidate item sets against the 
database, (ii) test and discard the ones that are not large item sets, and (iii) 
generate candidate item sets of length fc -I- 1 from the length k large items sets 
just generated and continue to next iteration level. The purpose of the second 
step is to prune potential candidates that are not going to generate any large 
item sets. This heuristic is called the anti-monotonicity property of large item 
sets. While this heuristic saves an enormous amount of space and time in large 
item set computation and virtually makes association rule mining feasible, fur- 
ther improvements are possible. We make an observation that apriori fails to 
remove redundant large item sets that really do not contribute anything new, 
and not generating the redundant large item sets do not cause any adverse effect 
on the discovery of the set of association rules implied by the database. In other 
words, apriori fails to potentially recognize another very important optimization 
opportunity. Perhaps the most significant and striking contribution of this new 
optimization opportunity is its side effect on the declarative computation of as- 
sociation rules using languages such as SQL which is the subject of this article. 
This observation of optimization opportunity helps us avoid thinking level wise 
and allows us to break free from the expensive idea of candidate generation and 
testing even in SQL like set up such as in [25,21,20]. As mentioned earlier, meth- 
ods such as [4,27] have already achieved significant performance improvements 
over apriori by not requiring to generate candidate item sets. 

To explain the idea we have on intuitive grounds, let us consider the 
simple transaction table tTable in figure 1. Apriori will produce the database in 
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figure 5(a) in three iterations when the given support threshold is « 25%, or 2 
out of 7 transactions. 




Fig. 5. ( a) Frequent item set table generated by apriori and other major algo- 
rithms such as FP-tree. Candidate sets generated by a very smart apriori at 
iterations (b) fc = 1, (c) fc = 2, and at (d) k = 3 

Although apriori generates the table in figure 5(a), it needs to generate the 
candidate sets in figure 5(b) through 5(d) to test. Notice that although the can- 
didates {a, d}, {a, /}, {b, d}, {c, d}, {c, /}, {d, /}, {a, b, /} and {b, c, /} were gen- 
erated (figures 5(c) and 5(d)), they did not meet the minimum support threshold 
and were never included in the frequent item set table in figure 5(a). Also no- 
tice that these are the candidate item sets generated by a very smart apriori 
algorithm that scans the large items sets generated at pass k in order to guess 
the possible set of large item sets it could find in pass fc -I- 1, and selects the 
guessed sets as the candidate item sets. A naive algorithm on the other hand 
would generate all the candidate sets from the large item sets generated at pass k 
exhaustively by assuming that they are all possible. Depending on the instances, 
they both have advantages and disadvantages. But no matter which technique 
is used, apriori must generate a set of candidates, store them in some structures 
to be able to access them conveniently and check them against the transaction 
table to see if they become large item sets at the next iteration step. Even by 
conservatively generating a set of candidate item sets as shown in figures 5(b) 
through 5(d) for the database in figure l(t_table), it wastes (for the candidate 
sets that never made it to the large item sets) time and space for some of the 
candidate sets. Depending on the transaction databases, the wastage could be 
significant. The question now is, could we generate candidate sets that will have 
a better chance to become a large item set? In other words, could we generate 
the absolute minimum set of candidates that are surely a large item set? In some 
way, we think the answer is in the positive as we explain in the next section. 



5.1 Implications 

We take the position and claim that the table Ltable shown in figure 1 is 
an information equivalent table of figure 5(a) which essentially means that 
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these tables faithfully imply one another (assuming identical support thresholds, 
« 25%). Let us now examine what this means in the context of association rule 
mining on intuitive grounds. 

First of all, notice that the tuples (i.e., large item sets) missing in table of 
figure 1 are listed in table of figure 5(e), i.e., the union of these two tables gives 
us the table in figure 5(a). Now the question is why do we separate them and 
why do we deem the tuples in figure 5(e) redundant? Before we present the 
reasoning, we would like to define the notion of redundancy of large item sets in 
order to keep our discussion in perspective. 

Definition 5.1. Let T be a transaction table over item sets X, I C X be an item 
set, and n be a positive integer. Also let n represent the frequency of the item set 
I with which it appears in T. Then the pair (/,n) is called a frequent item set, 
and the pair is called a large item set if n > 6m, where 6m is the minimum 
support threshold. 

We define redundancy of large item sets as follows. If for any large item 
set I, its frequency n can be determined from other large item sets, then I is 
redundant. Formally, 

Definition 5.2 (Redundancy of Large Item Sets). Let L be a set of large 
item sets of tuples of the form (/„, n„) such that Vx, y{x = {Ix, rix),y = {ly, riy) € 
L/\Ix = ly ^ Ux = Uy), and let u = (/„, n„) be such a tuple. Then u is redundant 
in L if 3w(u £ L,v = (J„, n„), /„ C n„ = n„). 

The importance of the definition 5.1 may be highlighted as follows. For any 
given set of large item sets L, and an element I = {Ii, nf) G L, /; is unique in L. 
The implication of anti-monotonicity is that for any other v = {Iv,nv) G L such 
that Ii C ly holds, < ni because an item set cannot appear in a transaction 
database less number of times than any of its supersets. But the important case 
is when Uy = ni yet Ii C ly. This implies that never appears in a transaction 
alone, i.e., it always appeared with other items. It also implies for all large item 
sets s = {Is, Us) G L of ly such that Ig D ly, if it exists, Uy = Ug too. As if not, 
ni should be different than Uy, which it is not, according to our assumption. It 
also implies that Ii is not involved in any other sub-superset relationship chains 
other that ly. There are several other formal and interesting properties that the 
large item sets satisfy some of which we will present in a later section. 

The importance of the equality of frequency counts of large item sets that 
are related via sub-superset relationships is significant. This observation offers 
us another opportunity to optimize the computation process of large item sets. 
It tells us that there is no need to compute the large item sets for which there 
exists another large item set which is a superset and has identical frequency 
count. For example, for an item set S = {a,b,c,d,e, f,g,h}, apriori will iterate 
eight times if S' is a large item set and generate |7^(S)| subsets of S with identical 
frequency counts when, say, S is the only distinct item set in a transaction table. 
A hypothetical smart algorithm armed with definition 5.1 will only iterate once 
and stop. Now, if needed, the other large item sets computed by apriori can be 




68 



H.M. Jamil 



computed from S just by generating all possible subsets of S and copying the 
frequency count of S. If S is not a large item set, so cannot be any subset of S. 
Apriori will discover it during the first iteration and stop, only if S is not a large 
item set, and so will the smart algorithm. 

Going back to our example table in figure 5(a), and its equivalent table Ltable 
in figure 1, using definition 5.1 we can easily conclude that the set of large item 
sets in table 5(e) are redundant. For example, the item set {a} is a subset of 
{a, b, c} and both have frequency or support count 2. This implies that there 
exists no other transactions that contribute to the count of a. And hence, it 
is redundant. On the other hand, {5} is not redundant because conditions of 
definition 5.1 does not apply. And indeed we can see that {6} is a required and 
non-redundant large item set because if we delete it, we cannot hope to infer 
its frequency count from any other large item sets in table Ltable of figure 1. 
Similar arguments hold for other tuples in Ltable of figure 1. 

The non-redundant large item set table in figure 1 unearth two striking 
and significant facts. All the item sets that are found to be large either ap- 
pear as a transaction in the tTable in figure 1, e.g., {b, f} and {a,b,c}, or are 
intersections of two or more item sets of the source table not related via sub- 
superset relationships, e.g., {b}, which is an intersection of {b,e} and {&, c, /}, 
and {6, e} {b, c, /} and {b, e} 7 ! {b, c, /}. 

We would like to point out here that depending on the database instances it is 
possible that apriori will generate an optimal set of candidate sets and no amount 
of optimization is possible. Because in that situation, all the candidate sets that 
were generated would contribute towards other large item sets and hence, were 
required. This implies that the candidate sets were themselves large item sets 
by way of anti-monotonicity property of large item sets. This will happen if 
there are no redundant large item sets. But the issue here is that when there 
are redundant large item sets, apriori will fail to recognize that. In fact, FP-tree 
[4] and CHARM [27] gains performance advantage over apriori when there are 
long chains and low support threshold due to this fact. Apriori must generate 
the redundant set of large item sets to actually compute the non-redundant ones 
while the others don’t. 

This observation is important because it sends the following messages: 

• Only the item sets that appear in a source table can be a large item set, or 
their meets with any other item set in the source table can be a large item 
set, if ever. 

• There is no need to consider any other item set that is not in the source table 
or can be generated from the source table by computing the least fixpoint 
of the meets of the source item sets, as the others are invariably redundant, 
even if they are large item sets. 

• The support count for any large item set can be obtained by adding the 
frequency counts of its ancestors (superset item sets) in the source table 
with its own frequency count. 

• No item set in the item set lattice will ever contribute to the support count 
of any item set other than the source item sets (transaction nodes/records). 
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These observations readily suggest the approach we adopted here in devel- 
oping a relational solution to the mining problem. All we needed was to apply a 
least fixpoint computation of the source items sets to find their meets. Then we 
applied the idea of inheritance of frequency counts of ancestors (source item sets) 
to other source item sets as well as to the newly generated meet item sets. It is ev- 
ident that the least fixpoint of the meets we need to compute is only for the set of 
items that are not related by a subset superset relationship in the item set lattice. 

6 An SQL3 Mining Operator 

We are now ready to discuss our proposal for a mining operator for SQL3. We 
already know that the (non-redundant) large item sets and the (non-redundant) 
rules can be computed using SQL3 for which we have discussed a series of ex- 
amples and expressions. We also know from our previous discussion that the 
method we have adopted is sound. We further know that the set of rules com- 
puted by our method is identical to the set computed by non-redundant apriori, 
or are equivalent to rules computed by naive apriori. So, it is perfectly all right 
to abstract the idea into an operator for generic use. 

The mine by operator shown below will generate the Ltable in figure 1. Basi- 
cally, its semantics translates to the set of view definitions (or their equivalents) 
for nTable, Ltable, intdable, iTable, cTable and Ltable. However, only Ltable 
view is returned to the user as a response of the mining query, and all the other 
tables remain hidden (used by the system and discarded). Notice that we have 
supplied two column names to the mine by operator - Tranid and Items. The 
Tranid column name instructs the system that the nesting should be done on 
this column and thus the support count comes from the count of Tranids for any 
given set of Items. The Items column name suggests that the equivalent of the 
Ltable shown in figure 1 should be constructed for the Items column. Essentially, 
this mine by expression will produce the Ltable of figure 1 once we set 6^ = 0.25. 

select Items, sup(Tranid) as Support 
from t-table 

mine by Tranid for Items 
having sup( Tranid) > 6m 

We have also used a having clause for the mine by operator in a way similar to 
the having clause in SQL group by operator. It uses a function called sup. This 
function, for every tuple in the c.table, generates the ratio of the Support to the 
total number of distinct transactions in the tTable. Consequently, the having 
option with the condition as shown filters unwanted tuples (large item sets). 
The select clause allows only a subset of the column names listed in the mine by 
clause along with any aggregate/mine operations on them. In this case, we are 
computing support for every item set using the sup function just discussed. 

For the purpose of generating the association rules, we propose the so called 
extract rules using operator. This operator requires a list of column names, for 
example Items, using which it derives the rules. Basically the expression below 
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produces the r_table of figure 1 for rjm = 0.40. Notice that we have used a having 
clause and a mine function called conf that computes the confidence of the rule. 
Recall that the confidence of a rule can be computed from the support values in 
the Ltable - it is the (appropriately taken) ratio of the two supports. 

select ant{t. Items) as Ant, cons{t. Items) as Conseq, t. Support, coni{t. Support) 
from (select Items, sup( Tramd) as Support 
from Ltable 

mine by Tranid for Items 
having sup( Tramd) > 8m) as t 
extract rules using t. Items on t. Support 
having coni{t. Support) > rjm 

Notice that this query is equivalent to the view definition for r_table in section 
4. Consequently here is what this syntax entails. The extract rules using clause 
forces a Cartesian product of the relation (or the list of relations) named in the 
from clause. Naturally, the two attribute names mentioned in the extract rules 
using clause will have two copies in two columns. As explained as part of the 
r_table view discussion in section 4, from these four attributes all the necessary 
attributes of r_table can be computed even though we mention only two of the 
four attributes without any confusion (see r_table view definition) . All this clause 
needs to know is which two attributes it must use from the relation in the from 
clause, and among them which one has the support values. The rest is trivial. 

The mine functions ant and cons generates the antecedent and consequent 
of a rule from the source column included as the argument. Recall that rule 
extraction is done on a source relation by pairing its tuples (Cartesian product) 
and checking for conditions of a valid rule. It must be mentioned here that the 
ant and cons functions can also be used in the having clause. For example if 
we were interested in finding all the rules for which ab is in the consequent, we 
would then rewrite the above rule as follows: 

select ant{t.Items) as Ant, cons(t.Items) as Conseq, t. Support, conf{t. Support) 
from (select Items, sup(7rom(i) as Support 
from Ltable 

mine by Tranid for Items 
having sup( Tramd) > 8m) as t 
extract rules using t. Items on t. Support 
having coni{t. Support) > ijm and cons{t. Items) = {a,b} 

It is however possible to adopt a single semantics for mine by operator. To this 
end we propose a variant of the mine by operator to make a syntactic distinction 
between the the two, called the mine with operator, as follows. In this approach, 
the mine with operator computes the rules directly and does not produce the 
intermediate large item set table Ltable. In this case, however, we need to change 
the syntax a bit as shown below. Notice the change is essentially in the argument 
of conf function. Previously we have used the Support column of the Ltable, but 
now we use the Tranid column instead. The reason for this choice makes sense 
since support is computed from this column and that support column is still 
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hidden inside the process and is not known yet, which was not the case for 
the extract rules using operator. In that case we knew which column to use as 
an argument for conf function. But more so, the Tranid column was not even 
available in the Ltable as it was not needed. 

select ant{Items) as Ant, cons{Items) as Conseq, 

sup(Tranid) as Support, con{{Supporf) as Confidence 

from tJahle 

mine with Tranid for Items 

having s\ip{Tranid) > 6m and conf( Tranid) > pm 

Here too, one can think of appropriately using any of the mine functions 
in the having clause to filter unwanted derivations. We believe, this modular 
syntax and customizable semantics brings in strength and agility in our system. 

We would like to point out here that while both the approaches are appealing, 
depending on the situations, we prefer the first approach - breaking the process 
in two steps. The first approach may make it possible to use large item sets for 
other kind of computations that were not identified yet. Conversely speaking, 
the single semantics approach makes it difficult to construct the large item sets 
for any sort of analysis which we believe has applications in other system of rule 
mining. 

7 Optimization Issues 

While it was intellectually challenging and satisfying to develop a declarative 
expression for association rule mining from relational databases using only ex- 
isting (or standard) object relational machinery, we did not address the issue 
related to query optimization in this article. We address this issue in a separate 
article [6] for the want of space and also because it falls outside the scope of 
the current article. We would like to point out here that several non-trivial opti- 
mization opportunities exist for our mining operator and set value based queries 
we have exploited. Fortunately though, there has been a vast body of research 
in optimizing relational databases, and hence, the new questions and research 
challenges that this proposal raises for declarative mining may exploit some of 
these advances. 

There are several open issues with some hopes for resolution. In the worst 
case, the least fixpoint needs to generate tuples in the first pass alone when 
the database size is n - which is quite high. Theoretically, this can happen only 
when each transaction in the database produces an intersection node, and when 
they are not related by subset-superset relationship. In the second pass, we need 
to do computations, and so on. The question now is, can we avoid generating, 
and perhaps scanning, some of these combinations as they will not lead to useful 
intersections? For example, the node Cg in figure 3 is redundant. In other words, 
can we only generate the nodes within the sandwich and never generate any node 
that we would not need? A significant difference with apriori like systems is that 
our system generates all the item sets top down (in the lattice) without taking 
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their candidacy as a large item set into consideration. Apriori, on the other hand, 
does not generate any node if their subsets are not large item sets themselves, and 
thereby prunes a large set of nodes. Optimization techniques that exploit this so 
called “anti-monotonicity” property of item set lattices similar to apriori could 
make all the difference in our setup. The key issue would be how we push the 
selection threshold (minimum support) inside the top down computation of the 
nodes in the lattice in our method. Technically, we should be able to combine 
view int_table with i_table, c_table and Ltable and somehow not generate a 
virtual node that is outside the sandwich (below the z-envelope in figure 3). 
This will require pushing selection condition inside aggregate operations where 
the condition involves the aggregate operation itself. 

For the present, and for the sake of this discussion, let us consider a higher 
support threshold of 45% (3 out of 7 transactions) for the database T of fig- 
ure 1. Now the 1-envelope will be moving even closer to the z-envelop and the 
nodes bff and dl will be outside this sandwich. This raises the question, is it 
possible to utilize the support and confidence thresholds provided in the query 
and prune candidates for intersection any further? Ideas similar to magic sets 
transformation [2,26] and relational magic sets [15] may be borrowed to address 
this issue. The only problem is that pruning of any node depends on its support 
count which may come at a later stage. By then all nodes may already have 
been computed. Specifically, pushing selection conditions inside aggregate oper- 
ator may become challenging. Special data structures and indexes may perhaps 
aid in developing faster methods to compute efficient intersection joins that we 
have utilized in this article. We leave these questions as open issues that should 
be taken up in the future. 

Needless to emphasize, a declarative method, preferably a formal one, is de- 
sirable because once we understand the functioning of the system, we will then 
be able to select appropriate procedures depending on the database instances 
to compute the relational queries involving mining operators which we know 
is intended once we establish the equivalence of declarative and procedural se- 
mantics of the system. Fortunately, we have numerous procedural methods for 
computing association rules which complement each other in terms of speed and 
database instances. In fact, that is what declarative systems (or declarativity) 
buy us - a choice for the most efficient and accurate processing possible. 

8 Comparison with Related Research 

We would like to end our discussion by highlighting some of the contrasts between 
our proposal and the proposals in [25,21,13,14]. It is possible to compute the rules 
using the mine rule operator of [13,14] as follows. 

mine rule simpleassociation as 

select distinct l..n Items as body, l..n Items as head, support, confidence 
from tJahle 
group by Tranid 

extracting rules with support; 6m, confidence: r/m 
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As we mentioned before, this expression is fine, but may be considered rigid 
and does not offer the flexibilities offered by our syntax. But the main difference 
is in its implementation which we have already highlighted in section 1.1. 

It is relatively difficult to compare our work with the set of works in [25,21,20] 
as the focus there is somewhat different, we believe. Our goal has been to define a 
relational operator for rule mining and develop the formal basis of the operator. 
Theirs, we believe, was to develop implementation strategies of apriori in SQL 
and present some performance metrics. As long as the process is carried out 
by a system, it is not too difficult to develop a front end operator that can be 
implemented using their technique. Even then, the only comparison point with 
our method would be the execution efficiency. In fact, it might become possible 
to implement our operator using their technique. 



9 Conclusions 

It was our goal to demonstrate that association rules can be computed using ex- 
isting SQL3 machineries, which we believe we have done successfully. We have, of 
course, used some built-in functions for set operations that current SQL systems 
do not possibly support, but we believe that future enhancements of SQL will. 
These functions can be easily implemented using SQL’s create function state- 
ments as we have done. We have utilized SQL’s create view recursive clause to 
generate the intersection nodes which was implemented in PL/SQL. 

If one compares the SQL3 expressions presented in this article with the series 
of SQL expressions presented in any of the works in [25,21,13,14] that involve 
multiple new operators and update expressions, the simplicity and strength of 
our least fixpoint based computation will be apparent. Hence, we believe that 
the idea proposed in this article is novel because to our knowledge, associa- 
tion rule mining using standard SQL/SQL3 is unprecedented. By that, we mean 
SQL without any extended set of operators such as combination and GatherJoin 

[25.21] , or the CountAllGroups, MakeClusterPairs, ExtractBodies, and Extrac- 
tRules operators in [13,14]. 

Our mine by operator should not be confused with the set of operators in 

[25.21] and [13,14]. These operators are essential for their framework to function 
whereas the mine by operator in our framework is not necessary for our mining 
queries to be functional. It is merely an abstraction (and a convenience) of the 
series of views we need to compute for association rule mining. The method 
proposed is soundly grounded on formal treatment of the concepts, and its cor- 
rectness may be established easily. We did not attempt to prove the correctness 
for the sake of conciseness and for want of space, but we hope that readers may 
have already observed that these are just a matter of details, and are somewhat 
intuitive too. 

The mine by operator proposed here is simple and modular. The flexibilities 
offered by it can potentially be exploited in real applications in many ways. The 
operators proposed can be immediately implemented using the existing object 
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relational technology and exploit existing optimization techniques by simply 
mapping the queries containing the operators to equivalent view definitions as 
discussed in section 4. These are significant in terms of viability of our proposed 
framework. 

As a future extension of the current research, we are developing an efficient 
algorithm for top-down procedural computation of the non-redundant large item 
sets and an improved SQL3 expression for computing such a set. We believe that 
a new technique for computing item set join (join based on subset condition as 
shown in the view definition for int_table) based on set indexing would be useful 
and efficient. In this connection, we are also looking into query optimization 
issues in our framework. 
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Abstract. We present a logic database language with elementary data 
mining mechanisms to model the relevant aspects of knowledge discovery, 
and to provide a support for both the iterative and interactive features 
of the knowledge discovery process. We adopt the notion of user-defined 
aggregate to model typical data mining tasks as operations unveiling un- 
seen knowledge. We illustrate the use of aggregates to model specific data 
mining tasks, such as frequent pattern discovery, classification, data dis- 
cretization and clustering, and show how the resulting data mining query 
language allows the modeling of typical steps of the knowledge discovery 
process, that range from data preparation to knowledge extraction and 
evaluation. 



1 Introduction and Motivations 

Research in data mining and knowledge discovery in databases has mostly con- 
centrated on algorithmic issues, assuming a naive model of interaction in which 
data is first extracted from a database and transformed in a suitable format, and 
next processed by a specialized inductive engine. Such an approach has the main 
drawback of proposing a fixed paradigm of interaction. Although it may at first 
sound appealing to have an autonomous data mining system, it is practically un- 
feasible to let the data mining algorithm “run loose” into the data in the hope to 
find some valuable knowledge. Blind search into a database can easily bring to 
the discovery of an overwhelming large set of patterns, many of which could be 
irrelevant, difficult to understand, or simply not valid: in one word, uninteresting. 

On the other side, current applications of data mining techniques highlight 
the need for flexible knowledge discovery systems, capable of supporting the user 
in specifying and refining mining objectives, combining multiple strategies, and 
defining the quality of the extracted knowledge. A key issue is the definition of 
Knowledge Discovery Support Environment [16], i.e., a query system capable of 
obtaining, maintaining, representing and using high level knowledge in a unified 
framework. This comprises representation of domain knowledge, extraction of 
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new knowledge and its organization in ontologies. In this respect, a knowledge 
discovery support environment should be an integrated mining and querying 
system capable of 

— rigorous definition of user interaction during the search process, 

— separation of concerns between the specification and the mapping to the 

underlying databases and data mining tools, and 

— understandable representations for the knowledge. 

Such an environment is expected to increase the programmer productivity 
of KDD applications. However, such capabilities require higher-order expressive 
features capable of providing a tight-coupling between knowledge mining and 
the exploitation of domain knowledge to share the mining itself. 

A suitable approach can be the definition of a set of data mining primitives, 
i.e., a small number of constructs capable of supporting a vast majority of KDD 
applications. The main idea (outlined in [13]) is to combine relational query lan- 
guages with data mining primitives in an overall framework capable of specifying 
data mining problems as complex queries involving KDD objects (rules, cluster- 
ing, classifiers, or simply tuples). In this way, the mined KDD objects become 
available for further querying. The principle that query answers can be queried 
further is typically referred to as closure, and is an essential feature of SQL. 
KDD queries can thus generate new knowledge or retrieve previously generated 
knowledge. This allows for interactive data mining sessions, where users cross 
boundaries between mining and querying. Query optimization and execution 
techniques in such a query system will typically rely on advanced data mining 
algorithms. 

Today, it is still an open question how to realize such features. Recently, the 
problem of defining a suitable knowledge discovery query formalism has interes- 
tend many researchers, both from the database perspective [1,2,13,19,14,11] and 
the logic programming perspective [21,22,20]. The approaches devised, however, 
do not explicitly model in a uniform way features such as closure, knowledge ex- 
tration and representation, background knowledge and interestingness measures. 
Rather, they are often presented as “ad-hoc” proposals, particularly suitable only 
for subsets of the described peculiarities. 

In such a context, the idea of integrating data mining algorithms in a de- 
ductive environment is very powerful, since it allows the direct exploitation of 
domain knowledge within the specification of the queries, the specification of 
ad-hoc interest measures that can help in evaluating the extracted knowledge, 
the modelization of the interactive and iterative features of knowledge discovery 
in a uniform way. In [5,7] we propose two specific models based on the notion 
of user-defined aggregate, namely for association rules and bayesian classifica- 
tion. In these approaches we adopt aggregates as an interface to mining tasks 
in a deductive database, obtaining a powerful amalgamation between inferred 
and induced knowledge. Moreover, in [4,8] we show that efficiency issues can be 
takled in an efficient way, by providing a suitable specification of aggregates. 

In this paper we generalize the above approaches. We present a logic database 
language with elementary data mining mechanisms to model the relevant aspects 
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of knowledge discovery, and to provide a support for both the iterative and 
interactive features of the knowledge discovery process. We shall refer to the 
notion of inductive database introduced in [17,18,1], and provide a logic database 
system capable of representing inductive theories. The resulting data mining 
query language shall incorporate all the relevant features that allow the modeling 
of typical steps of the knowledge discovery process. 

The paper is organized as follows. Section 2 provides an introduction to user- 
defined aggregates. Section 3 introduces the formal model that allows to incorpo- 
rate both induced and deduced knowledge in a uniform framework. In section 4 
we instantiate the model to some relevant mining tasks, namely frequent pat- 
terns discovery, classification, clustering and discretization, and show how the 
resulting framework allows the modeling of the interactive and iterative features 
of the knowledge discovery process. Finally, section 5 discusses how efficiency 
issues can be profitably undertaken, in the style of [8]. 

2 User-Defined Aggregates 

In this paper we refer to datalog-i-+ and its current implementation CVC++, 
a highly expressive language which includes among its features recursion and a 
powerful form of stratified negation [23], and is viable for efficient query evalu- 
ation [6]. 

A remarkable capability of such a language is that of expressing distributive 
aggregates, such as sum or count. For example, the following clause illustrates 
the use of the count aggregate: 

p(X, count(Y)) <— r(X, Y). 

It is worth noting the semantic equivalence of the above clause to the following 
SQL statement: 

SELECT X, COUNT (Y) 

FROM r 
GROUP BY X 

Specifying User-defined aggregates. Clauses with aggregation are possible mainly 
because datalog++ supports nondeterminism [9] and XY-stratification [6,23]. 
This allows the definition of distributive aggregate functions, i.e., aggregate 
functions defined in an inductive way {S denotes a set and /i is a composition 
operator): 

Base: /({a:}) := g(x) 

Induction: f{S U {x}) := h{f{S),x) 

Users can define aggregates in datalog-i-+ by means of the predicates single 
and multi. For example, the count aggregate is defined by the unit clauses: 

single(count, X, 1). 
multi(count, X, C, C -|- 1). 
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The first clause specifies that the count of a set with a single element a: is 1. The 
second clause specifies that the count of S'U {x} is c + 1 for any set S such that 
the count of S is c. 

Intuitively, a single predicate computes the Base step, while the multi 
predicate computes the Inductive step. The evaluation of clauses containing ag- 
gregates consists mainly va (i) evaluating the body of the clause, and nondeter- 
ministically sorting the tuples resulting from the evaluation, and (ii) evaluating 
the single predicate on the first tuple, and the multi predicate on the other 
tuples. 

The f return and ereturn predicates [25] allow building complex aggregate 
functions from simpler ones. For example, the average of a set S is obtained 
by dividing the sum of its elements by the number of elements. If S contains 
c elements whose sum is s, then S U {x} contains c -I- 1 elements whose sum is 
s -|- X. This leads to the following definition of the avg function: 

single(avg,X, (X, 1)). 
multi(avg,X, (S,C),(S + X,C-H 1)). 
freturn(avg, (S, C), S/C). 

Iterative User-Defined Aggregates. Distributive aggregates are easy to define by 
means of the user-defined predicates single and multi in that they simply re- 
quire a single scan of the available data. In many cases, however, even simple 
aggregates require multiple scans of the data. As an example, the absolute devi- 
ation Sn = of a set of n elements is defined as the sum of the absolute 

difference of each element with the average value x = of the set. In 

order to compute such an aggregate, we need to scan the available data twice: 
first, to compute the average, and second, to compute the sum of the absolute 
difference. 

However, the datalog-i-i- language is powerful enough to cope with multiple 
scans in the evaluation of user-defined aggregates. Indeed, in [4] we extend the 
semantics of datalog-i-+ by introducing the iterate predicate, which can be 
exploited to impose some user-defined conditions for iterating the scans over the 
data. More specifically, the evaluation of a clause 

p(X, aggr(Y)) r(X,Y). (1) 

(where aggr denotes the name of a user-defined aggregate) can be done according 
to the following schema: 

— Evaluate r(X, Y), and group the resulting values associated with Y into subsets 
S'!, . . . , S'n according to the different values associated with X. 

— For each subset Si = {xi, . . . ,x„}, compute the aggregation value r as fol- 
lows: 

1. evaluate single(aggr, xi, C). Let c be the resulting value associated with 

C. 

2. Evaluate multi(aggr, Xj, c, C), for each i (by updating c to the resulting 
value associated with C). 
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3. Evaluate iterate(aggr, c, C). If the evaluation is successful, then update 
c to the resulting value associated with C, and return to step 2. Otherwise, 
evaluate freturn(aggr, c, R) and return the resulting value r associated 
with R. 

The following example shows how iterative aggregates can be defined in the 
datalog++ framework. Further details on the approach can be found in [16,5]. 

Example 1. By exploiting the iterate predicate, the abserr aggregate for com- 
puting the absolute deviation Sn can be defined as follows: 

single(abserr, X, (avg, X, 1)). 

multi(abserr. (nil, S, C), X, (avg, S -|- X, C -|- 1)). 

multi(abserr, (M,D),X, (M, D -b (M- X))) ^ M > X. 
multi(abserr, (M, D), X, (M, D -b (X - M))) ^ M < X. 

iterate(abserr, (avg, S, C), (S/C, 0)). 

freturn(abserr, (M, D),D). 

The first two clauses compute the average of the tuples under examination, 
in a way similar to the computation of the avg aggregate. The remaining clauses 
assume that the average value has already been computed, and are mainly used 
in the incremental computation of the sum of the absolute difference with the 
average. Notice how the combined use of multi and iterate allows the definition 
of two scans over the data. □ 



3 Logic-Based Inductive Databases 

A suitable conceptual model that summarizes the relevant aspects discussed in 
section 1 is the notion of inductive database [1,17], that is is a first attempt to 
formalize the notion of interactive mining process. In the following definition, 
proposed by Mannila [18,17], the term inductive database refers to a relational 
database plus the set of all sentences from a specified class of sentences that are 
true of the data. 

Definition 1. Given an instance r of a relation R, a class C of sentences (pat- 
terns), and a selection predicate q, a pattern discovery task is to find a theory 

Th{C,v,q) = {s e £|g(r, s) is true} 



□ 

The main idea here is to provide a unified and transparent view of both 
deductive knowledge, and all the derived patterns, (the induced knowledge) over 
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the data. The user does not care about whether he/she is dealing with inferred 
or induced knowledge, and whether the requested knowledge is materialized 
or not. The only detail he/she is interested in is the high-level specification 
of the query involving both deductive and inductive knowledge, according to 
some interestingness quality measure (which in turn can be either objective or 
subjective). 

The notion of Inductive Database fits naturally in rule-based languages, such 
as Deductive Databases [7,5]. A deductive database can easily represent both 
extensional and intensional data, thus allowing a higher degree of expressiveness 
than traditional relational algebra. Such capability makes it viable for suitable 
representation of domain knowledge and support of the various steps of the KDD 
process. 

The main problem in a deductive approach is how to choose a suitable rep- 
resentation formalism for the inductive part, enabling a tight integration with 
the deductive part. More specifically, the problem is how to formalize the speci- 
fication of the set £ of patterns in a way such that each pattern s G Th{£, r, q) 
is represented as an independent (logical) entity (i.e., a predicate) and each 
manipulation of r results in a corresponding change in s. To cope with such a 
problem, we introduce the notion of inductive clauses, i.e., clauses that formalize 
the dependency between the inductive and the deductive part of an inductive 
database. 

Definition 2. Given an inductive database theory Th{L,v,q), an inductive 
clause for the theory is a clause (denoted by s) 

H ^ Bi, . . . Bn 



such that 

— The evaluation of B\,. . . Bn in the computed stable model^ Mgyjr correspond 
to the extension r; 

— there exist an injective function <j) mapping each ground instance p of H in 

C; 

— Th{L,v,q) corresponds to the model M^ur be., 

p G Msur (f{p) G T/i(£, r, q) 

□ 

As a consequence of the above definition, we can formalize the notion of 
logic-based knowledge discovery support environment, as a deductive database 
programming language capable of expressing both inductive clauses and deduc- 
tive clauses. 

Definition 3. A logic-based knowledge discovery support environment is a de- 
ductive database language capable of specifying: 



See [24,6] for further details. 
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— relational extensions; 

— intensional predieates, by means of deductive clauses; 

— inductive predicates, by means of inductive clauses. 

□ 

The main idea of the previous definition is that of providing a simple way for 
modeling the key aspects of a data mining query language: 

~ the source data is represented by the relational extensions; 

— intensional predicates provide a way of dealing with background knowledge; 

— inductive predicates provide a representation of both the extracted knowl- 
edge and the interestingness measures. 

In order to formalize the notion of inductive clauses, the first fact that is 
worth observing is that it is particularly easy to deal with data mining tasks in 
a deductive framework, if we use aggregates as a basic tool. 

Example 2 . A frequent itemset {/i,/2} is a database pattern with a validity 
specified by the estimation of the posterior probability Pr(/i, /2|r) (i.e., the prob- 
ability that items Ii and I2 appear together according to r). Such a probability 
can be estimated by means of iceberg queries: an iceberg query is a query con- 
taining an aggregate, in which a constraint (typically a threshold constraint) 
over the aggregate is specified. For example, the following query 

SELECT Rl.Item, R2.Item, CDUNT(Tid) 

FROM r Rl, r R2 
WHERE Rl.Tid = R2.Tid 

AND Rl.Item <> R2.Item 
GROUP BY Rl.Item, R2.Item 
HAVING COUNT (Tid) > thresh 

computes all the pairs of items appearing in a database of transactions with a 
given frequency. The above query has a straightforward counterpart in datalog-i— 1-. 
The following clauses define typical (two-dimensional) association rules by using 
the count aggregate. 

pair ( 11 , 12, count (T)) ^ basket (T, II), basket (T, 12), II < 12. 
rules(ll,12) ^ pair(ll, 12, C), C > 2. 

The first clause generates and counts all the possible pairs, and the second 
one selects the pairs with sufficient support (i.e., at least 2 ). As a result, the 
predicate rules specifies associations, i.e. rules stating that certain combinations 
of values occur with other combinations of values with a certain frequency. Given 
the following definitions of the basket relation, 

basket(l, f ish). basket(2, bread). basket(3, bread). 
basket(l, bread). basket(2,milk). basket (3, orange). 

basket(2, onions). basket(3, milk). 
basket(2, fish). 

by querying ruIes(X,Y) we obtain predicates that model the corresponding in- 
ductive instances: rules(bread, milk) and rules(bread, f ish). <1 
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Example 3. A classifier is a model that describes a discrete attribute, called the 
class, in terms of other attributes. A classifier is built from a set of objects (the 
training set) whose class values are known. Practically, starting from a table 
r, we aim at computing the probability Pr(C = c|A = a,r) for each pair o, c 
of the attributes A and C [10]. This probability can be roughly estimated by 
computing some statistics over the data, such as, e.g.: 

SELECT A, C, COUNT (*) 

FROM r 

GROUP BY A, C 

Again, the datalog++ language easily allows the specification of such statis- 
tics. Let us consider for example the playTennis(Out, Temp, Hum, Wind, Play) ta- 
ble. We would like to predict the probability of playing tennis, given the values 
of the other attributes. The following clauses specify the computation of the 
necessary statistics for each attribute of the playTennis relation: 

statisticsout(0, P, count(*)) ^ playTennis(0, T, H, W, P). 
statisticsTemp(T, P, count(*)) ^ playTennis(0, T, H, W, P). 
statisticsHuin(H, P, count(*)) ^ playTennis(0, T, H, W, P). 
statisticswind(W 5 P, count(*)) v- playTennis(0, T, H, W, P). 
statisticspiay(P, count(*)) ^ playTennis(0, T, H, W, P). 

The results of the evaluation of such clauses can be easily combined in order 
to obtain the desired classifier. < 

The above examples show how the simple clauses specifying aggregates can 
be devised as inductive clauses. Clauses containing aggregates, in languages such 
as datalog-i-i-, satisfy the most desirable property of inductive clauses, i.e., the 
capability to specify patterns of C that hold in Tft, in a “parameterized” way, 
i.e., according to the tuples of an extension r. In this paper, we use aggregates 
as the means to introduce mining primitives into the query language. 

As a matter of fact, an aggregate is a “natural” definition of an inductive 
database schema, in which patterns correspond to the true facts in the computed 
stable model, as the following statement shows. 

Lemma 1. An aggregate defines an inductive database. 

Proof. By construction. Let us consider the following clause (denoted by Vp): 
p( Ai , ... , Xji , aggr (Tp , • • • , Xyyfj ) r(^Xi , ... , X^i , Tp , ... , T^) . 

We then define 



L = {{ti, . . . ,tn,s)\p{ti, ...,tn,s) is ground} 

and 

q{r, (ti, . . . , tn, s}) = true if and only if p{ti, .. .,tn,s) G Mr^,ur 

that imposes that the only valid patterns are those belonging to the iterated 
stable model procedure. □ 
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In such a context, an important issue to investigate is the correspondence 
between inductive schemas and aggregates, i.e., whether a generic inductive 
schema can be specified by means of an aggregate. Formally, for given an in- 
ductive database T/i(£,r, q), we are interested in providing a specification of an 
aggregate aggr, and in defining a clause 

q(Zi, ...,Zk, aggr(Xi, . . . ,X„)) ^ r(Yi, . . .,Y^). 

which should turn out to be a viable inductive clause for Th. The clause should 
define the format of any valid pattern s G In particular, the correspondence of 
s with the ground instances of the q predicate should be defined by the specifi- 
cation of the aggregate aggr (as described in section 2). Moreover, any predicate 
q(ti, . . . ,tk, s) resulting from the evaluation of such a clause and correspond- 
ing to s, should model the fact that s G Th{£,r,q). When such a definition is 
possible, the “inductive” predicate q itself can be used in the definition of more 
complex queries. 

Relating the specification of aggregates with inductive clauses is particularly 
attractive for two main reasons. First of all, it provides an amalgamation between 
mining and querying, and hence makes it easy to provide a unique interface 
capable of specifying source data, knowledge extraction, background knowledge 
and interestingness specification. Moreover, it allows a good flexibility in the 
exploitation of mining algorithms for specific tasks. Indeed, we can implement 
aggregates as simple language interfaces for the algorithms (implemented as 
separate modules, like in [7]); conversely, we can exploit the notion of iterative 
aggregate, and explicitly specify the algorithms in datalog+-i- (like in [8]). The 
latter approach, in particular, gives two further advantages: 

— from a conceptual point of view, it allows the use of background knowledge 
directly in the exploration of the search space. 

— from an efficiency point of view, it provides the opportunity of integrating 
specific optimizations inside the algorithm. 



4 Mining Aggregates 

As stated in the previous section, our main aim is the definition of inductive 
clauses, formally modeled as clauses containing specific user-defined aggregates, 
and representing specific data mining tasks. The discussion on how to provide 
efficient implementations of algorithms for such tasks by means of iterative ag- 
gregates is given in [8]. The rest of the paper is devoted at showing how the 
proposed model is suitable to some important knowledge discovery tasks. We 
shall formalize some inductive clauses by means of aggregates), to formulate 
data mining tasks in the datalog+-i- framework. In the resulting framework, the 
integration of inductive clauses with deductive clauses allows a simple and in- 
tuitive formalization of the various steps of the data mining process, in which 
deductive rules can specify both the preprocessing and the result evaluation 
phase, while inductive rules can specify the mining phase. 
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4.1 Frequent Patterns Discovery 

Associations are rules that state that certain combinations of values occur with 
other combinations of values with a certain frequency and certainty. A general 
definition is the following. 

Let X = {ai, . . . , a„} be a set of literals, called items. An itemset T is a set 
of items such that T C X. Given a relation R = Ai . . . A„, a transaction in an 
instance r of R, associated to attribute Ai (where dom{Ai) = X) according to 
a transaction identifier Aj, is a set of items of the tuples of r having the same 
value of Aj. 

An association rule is a statement of the form X ^ Y, where X C X and 

Y C X are two sets of items. To an association rule we can associate some 
statistical parameters. The support of a rule is the percentage of transactions 
that contain the set X UY, and the confidence is the percentage of transactions 
that contain Y, provided that they contain X. 

The problem of association rules mining can be finally stated as follows: 
given an instance r of R, find all the association rules from the set of transactions 
associated to Ai (grouped according to Aj), such that for each rule A B[S,C], 
S > a and C > 7 , where cr is the support threshold and 7 is the confidence 
threshold. 

The following definition provides a formulation of the association rules mining 
task in terms of inductive databases. 

Definition 4. Let r be an instance of the table R = Ai . . . A„, and u, 7 G [0, 1]. 
For given i,j < n, let 

— C = {A B\A, B C dom(R[Ai])}, and 

— q{v,A ^ B) = true if and only if freq{A U B,v) > a and freq{A U 
B,r)/ freq{A,r) > 7 . 

Where freq{s,v) is the (relative) frequency of s in the set of the transactions in 

V grouped by Aj. The theory Xh{L,v,q) defines the frequent patterns discovery 

task. □ 

The above definition provides an inductive schema for the frequent pattern 
discovery task. We now specify a corresponding inductive clause. 

Definition 5. Given a relation r, the patterns aggregate is defined by the rule 

p(Xi, . . . ,Xn,patterns((m_s,m_c, Y))) <— r(Zi,...,Zm) (2) 

where the variables Xi, . . . , Xn, Y are a rearranged subset of the variables Zi, . . . , Z^ 
of r, and the Y variable denotes a set of elements. The aggregate patterns 
computes the set of predicates p(ti, . . . , t^, 1 , r, f , c) where: 

1 . ti, . . . , tn are distinct instances of the variables Xi, . . . , Xn, as resulting from 
the evaluation of r; 

2. 1 = {li, . . . , lij} and r = {ri, . . . , rh} are subsets of the value of Y in a tuple 
resulting from the evaluation of r; 
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3. / and c are respectively the support and the confidence of the rule 1 r, 
such that / > m_s and c > m_c. 

□ 

Example 4- Let us consider a sample transaction table. The following rule 
specifies the computation of association rules (with 30% support threshold and 
100% confidence threshold) starting from the extension of the table: 

transaction(D, C, (l)) ^ transaction(D, C, I, Q, P). , , 

rules(patterns((0.3, 1.0, S))) transaction(D, C, S). 

The first clause collects all the transactions associated to attribute I and 
grouped by attributes D and C. The second clause extracts the relevant patterns 
from the collection of available transactions. The result of the evaluation of 
the predicate rules(L, R, S, C) against such a program yields, e.g., the answer 
predicate rules(diapers, beer, 0.5, 1.0). <1 

It is easy to see how inductive clauses exploiting the patterns aggregate 
allow the specification of frequent pattern discovery tasks. The evidence of such 
a correspondence can be obtained by suitably specifying the patterns aggre- 
gate [8,16]. 



4.2 (Bayesian) Classification 

It is particularly simple to specify Naive Bayes classification by means of an in- 
ductive database schema. Let us consider a relation R with attributes , . . . , 
and C. For simplicity, we shall assume that all the attributes represent discrete 
values. This is not a major problem, since 

— it is well-known that classification algorithms perform better with discrete- 
valued attributes; 

— supervised discretization [15] combined with discrete classification allows a 
more effective approach. As we shall see, the framework based on inductive 
clauses easily allows the specification of discretization tasks as well. 

The bayesian classification task can be summarized as follows. Given an in- 
stance r of R, we aim at computing the function 

max Pr(C' = c]Ai = ai, . . . , A„ = a„,r) (4) 

C 

where c G dom{C) and Oj G dom{Ai). By repeated application of Bayes’ rule 
and the assumption that Ai, , A„ are independent, we obtain 

Pr(C' = c\Ai = ai, . . . , = a„,r) = Pr(C = c]r) x J]^Pr(Ai = ai\C = c,r) 

i 

Now, each factor in the above product can be estimated from r by means of 
the following equation 



Pr(Aj = aj\C = c, r) 



/reg(c, <JAj=aj (r)) 
freq{c,r) 
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Hence, the definition of a classification task can be accomplished by computing 
some suitable statistics. That is, a suitable inductive theory associates to each 
possible pair the corresponding statistic. 

Definition 6. Let R = Hi . . . H„(7 be a relation sehema. Given an instance r 
o/R, we define 

— C = {{Ai = Gi A C = c,nA,nc)\ai G dom(Ai),c € dom{C) and ua, 
nc G M}. 

— q{v, {Ai = Gi A C = c,nA,nc)) = true if and only if ua = Pr(Hi = Gi\C = 
c, r) and nc = Pr(C' = c|r). 

The resulting theory Th{C,r,q) formalizes a naive bayesian classification task. 

□ 

Notice that the datalog++ language easily allows the computation of all the 
needed statistics, by enumerating all the pairs for which we need to count the 
occurrences (see, e.g., example 3). However, we can associate the inductive theory 
with a more powerful user-defined aggregate, in which all the needed statistics 
can be efficiently computed without resorting to multiple clauses. 

Definition 7. Given a relation r, the nbayes aggregate is defined by a rule 
schema 

s(Xi, . . . , X„, nbayes(({(l, Ai), . . . , (n, A^)}, C))) 4- r(Zi, . . . , Zj,). 

where 

— The variables Xi, . . . , Xj,, Ai, . . . , A^, C are a (possibly rearranged) subset of the 
values o/ Zi, . . . , Zjj resulting from the evaluation of r; 

— The result of such an evaluation is a predicate s(ti, . . . , t„, c, (i, ai), Vi, Vc), 
representing the set of counts of all the possible values ai of the i-th attribute 
Ai, given any possible value c ofC. In particular, v± represents the frequency 
of the pair a^, c, and Vc represents the frequency of c in the extension r. 

□ 

Example 5. Let us consider the toy playTennis table defined in example 3. The 
frequencies of the attributes can be obtained by means of the clause 

classif ier(nbayes(({(l, 0), (2, T), (3, H), (4, W)}, P))) v- playTennis(D, T, H, W, P). 

The evaluation of the query classif ier(C, F, Cp, Cc) returns, e.g., the answer 
predicate classif ier(yes, (1, sunny), 0.6, 0.4). <1 

Again, by suitably specifying the aggregate, we obtain a correspondence be- 
tween the inductive database schema and its deductive counterpart [16,5]. 

4.3 Clustering 

Clustering is perhaps the most straightforward example of data mining task pro- 
viding a suitable representation of its results in terms of relational tables. In a 
relation R with instance r, the main objective of clustering is that of labeling 
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each tuple /i G r. In relational terms, this correspond in adding a set of attributes 
^ 1 , . . . , to R, so that a tuple (oi, . . . a„) associated to a tuple /x G r represents 
a cluster assigment for fx. For example, we can enhance R with two attributes 
C and M, where C denotes the cluster identifier and M denotes a probability 
measure. A tuple /i G r is represented in the enhancement of R by a new tuple 
where = /x[A] for each A G R, and fx'[C] represents the cluster to 

which n belongs, with probability ix'[M]. In the following defintion, we provide 
a sample inductive schema formalizing the clustering task. 

Definition 8. Given a relation R = Ai . . . A„ with extension r, such that tuples 
in r can he organized in k clusters, an inductive database modeling clustering is 
defined by Th{L,v,q), where 

— L = {(/X, i)\pL G dom{Ai) x . . . x dom{An),i G IN}, and 

— q{v, (/i, i)) is true if and only z/ /x G r is assigned to the i-th cluster. 

□ 

It is particularly intuitive to specify the clustering data mining task as an 
aggregate. 

Definition 9. Given a relation r, the cluster aggregate is defined by the rule 
schema 

p(Xi, . . . , X„, clusters((Yi, . . . , Yk))) ^ r(Zi, . . . , Z^). 

where the variables Xi, . . . , Xn, Yi, . . . , Y^ are a rearranged subset of the variables 
Zi, . . . , Z„ ofr. clusters computes the set of predicates p(ti, . . . , t^, Si, . . ., s^, c), 
where: 

1. ti, . . . , tn, Si, . . . , Sjt are distinct instances of the variables Xi, . . ., X^, Yi,. . .Y^, 
as resulting from the evaluation of r; 

2. c is a label representing the cluster to which the tuple si, . . . , s^ is assigned, 
according to some clustering algorithm. 

□ 

Example 6. Consider a relation customer(name, address, age, income), storing 
information about customers, as shown in fig. 1 a). We can define a “clustered” 
view of such a database by means of the following rule: 

custCView(clusters((N, AD, AG, l))) customer(N, AD, AG, l). 

The evaluation of such rule, shown in fig. 1 b), produces two clusters. <1 

It is particularly significant to see how the proposed approach allows to di- 
rectly model closure (i.e., to manipulate the mining results). 

Example 1. We can easily exploit the patterns aggregate to find an explanation 
of such clusters: 

frqPat(C, patterns((0.6, 1.0, {f _a(AD), s_a(l)}))) custCView(N, AD, AG, 1, C). 
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Fig. 1. a) Sample customer table. Each row represents some relevant features 
of a customer, b) Cluster assignments: a cluster label is associated to each row 
in the originary customer table, c) Cluster explanations with frequent patterns. 
The first cluster contains customers with high income and living in pisa, while 
the second cluster contains customers with low income and living in rome 

Notice how the cluster label C is used for separating transactions belonging 
to different clusters, and mining associations in such clusters separately. Table 1 
c) shows the patterns resulting from the evaluation of such a rule. As we can see, 
cluster 1 is mainly composed by high-income people living in Pisa, while cluster 
2 is mainly composed by low-income people living in Rome. <1 

4.4 Data Discretization 

Data discretization can be used to reduce the number of values for a given con- 
tinuous attribute, by dividing the range of the attribute into intervals. Interval 
labels can be used to replace actual data values. 

Given an instance r of a relation R with a numeric attribute A, the main 
idea is to provide a mapping among the values of dom{A) and some given labels. 
More precisely, we define C as the pairs (a, i), where a G dom{A) and i € V 
represents an interval label (i.e., V is a representation of all the intervals [l,s] 
such that l,s G dom{A)). A discretization task of the tuples of r, formalized as 
a theory Th{C,r,q), can be defined according to some discretization objective, 
i.e., a way of relating a value a G dom{A) to an interval label i. 

In such a context, discretization techniques can be distinguished into [12,3] 
supervised or unsupervised. The objective of supervised methods is to discretize 
continuous values in homogeneous intervals, i.e., intervals that preserve a prede- 
fined property (which, in practice, is represented by a label associated to each 
continuous value) . The formalization of supervised discretization as an inductive 
theory can be tuned as follows. 

Definition 10. Let r be an instance of a relation R. Given an attribute A G R, 
and a discrete-valued attribute C G R, an inductive database theory Th{C,r,q) 
defines a supervised discretization task if 
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— either dom{A) e IN or dom{A) € K; 

— £ = {(a, C, i)\a G dom{A),i G IN}; 

— q{v,{a,C,i)) = true if and only if there is a discretization of tta(j) in ho- 
mogeneous intervals with respect to the attribute C, such that a belongs to 
the i-th interval. 

□ 

The following definition provides a corresponding inductive clause. 
Definition 11. Given a relation r, the aggregate discr defines the rule schema 

s(Yi, . . . , Yk, discr((Y, C))) ^ r(Xi, Xj, . . . , X,). 

where 

— Y is a continuous-valued variable, and C is a discrete-valued variable; 

— Yi, . . . , Yjt, Y, C are a rearranged subset of the variables Xi, X 2 , . . . , X„ in r. 

The result of the evaluation of such rule is given by predicates p(ti, . . . , t^, v, i), 
where ti, . . . , tjj, v are distinct instances of the variables Yi, . . . , Yj^, Y, and i is 
an integer value representing the i-th interval in a supervised discretization of 
the values ofY labelled with the values ofC. □ 

In the ChiMerge approach to supervised discretization [15], a bottom-up in- 
terval generation procedure is adopted: initially, each single value is considered 
an interval. At each iteraction, two adjacent intervals are chosen and joined, 
provided that they are sufficiently homogeneous. The degree of homogeneity is 
measured w.r.t. a class label, and is computed by means a statistics. Homo- 
geneity is made parametric to a user-defined significance level a, identifying the 
probability that two adjacent intervals have independent label distributions. In 
terms of inductive clauses, this can be formalized as follows: 

s(Yi, . . . , Yk, discr((Y, C, a))) <- r(Xi, X 2 , . . . , X,). 

Here, Y represents the attribute to discretize, C represents the class label and 
a is the significance level. 

Example 8. The following rule 

intervals(discr((Price, Beer, 0.9))) ^ serves(_. Beer, Price). 

defines a 0.9 significance level supervised discretization of the price attribute, 
according to the values of beer, of the relation serves shown in fig. 2. More 
precisely, we aim at obtaining a discretization of price into intervals preserving 
the values of the beer attribute (as they appear associate to price in the tuples 
of the relation serves. For example, the values 100 and 117 can be merged into 
the interval [100, 117], since the tuples of serves in which such values occur 
contain the same value of the beer attribute (the Bud value). The results of the 
evaluation of the inductive clause are shown in fig. 2 c). <1 
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Fig. 2. a) The serves relation. Each row in the relation represents a brand of 
beer served by a given bar, with the associated price, b) ChiMerge Discretiza- 
tion: each value of the price attribute in serves is associated with an interval. 
Intervals contain values which occur in tuples of serves presenting similar values 
of the beer attribute 

It is particularly interesting to see how the discretization and classification 
tasks can be combined, by exploiting the respective logical formalizations. 

Example 9. A typical dataset used as a benchmark for classification tasks is the 
Iris classification dataset [15]. The dataset can be represented by means of a 
relation containing 5 attributes: 

iris(Sepal_length, Sepal_width, Petal_length, Petal.width, Specie) 

Each tuple in the dataset describes the relevant features of an iris flower. The 
first four attributes are continuous- valued attributes, and the Specie attribute is 
a nominal attribute corresponding to the class to which the flower belongs (either 
iris-setosa, iris-versicolor, or iris-virginica) . We would like to characterize species 
in the iris relation according to their features. To this purpose, we may need a 
preprocessing phase in which continuous attributes are discretized: 

intervalssL(discr((SL, C, 0.9))) iris(SL, SW, PL, PW, C). 
intervalssw(discr((SW, C, 0.9))) ^ iris(SL, SW, PL, PW, C). 
intervalspL(discr((PL, C, 0.9))) ^ iris(SL, SW, PL, PW, C). 
intervalspy(discr((PW, C, 0.9))) v- iris(SL, SW, PL, PW, C). 

The predicates defined by the above clauses provide a mapping of the contin- 
uous values to the intervals shown in fig. 3. A classification task can be defined 
by exploiting such predicates: 

irisCl(nbayes(({SL, SW,PL,PW},C))) v- iris(SLl, SWl, PLl, PWl, C), 

intervalssw(SWl, SW), 
intervalssL(SLl, SL), 
intervalspw(PWl, PW), 
intervalspL(PLl, PL). 
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Fig. 3. Bayesian statistics using ChiMerge and the nbayes aggregate. For each 
attribute of the iris relation, a set of intervals is obtained, and the distribution 
of the classes within such intervals is computed. For example, interval [4.3, 4.9) of 
the sepal_length attribute contains 16 tuples labelled as setosa, while interval 
[7.1, 7.9) of the same attribute contains 12 tuples labelled as versicolor 

The statistics resulting from the evaluation of the predicate defined by the 
above rule are shown in fig. 3. <1 

5 Conclusions 

The main purpose of flexible knowledge discovery systems is to obtain, maintain, 
represent, and utilize high-level knowledge. This includes representation and 
organization of domain and extracted knowledge, its creation through specialized 
algorithms, and its utilization for context recognition, disambiguation, and needs 
identification. Current knowledge discovery systems provide a fixed paradigm 
that does not sufficiently supports such features in a coherent formalism. On the 
contrary, logic-based databases languages provide a flexible model of interaction 
that actually supports most of the above features in a powerful, simple and 
versatile formalism. This motivated the study of a logic-based framework for 
intelligent data analysis. 

The main contribution of this paper was the development of a logic database 
language with elementary data mining mechanisms to model extraction, repre- 
sentation and utilization of both induced and deduced knowledge. In particular, 
we have shown that aggregates provide a standard interface for the specification 
of data mining tasks in the deductive environment: i.e., they allow to model 
mining tasks as operations unveiling pre-existing knowledge. We used such main 
features to model a set of data mining primitives: frequent pattern discovery, 
Bayesian classification, clustering and discretization. 

The main drawback of a deductive approach to data mining query languages 
concerns efficiency: a data mining algorithm can be worth substantial optimiza- 
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tions that come both from a smart constraining of the search space, and from 
the exploitation of efficient data structures. In this case, the adoption of the 
datalog++ logic database language has the advantage of allowing a direct specifi- 
cation of mining algorithms, thus allowing specific optimizations. Practically, we 
can directly specify data mining algorithms by means of iterative user-defined 
aggregates, and implement the most computationally intensive operations by 
means of hot-spot refinements [8]. Such a feature allows to modularize data min- 
ing algorithms and integrate domain knowledge in the right points, thus allowing 
crucial domain-oriented optimizations. 
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Abstract. Spatial data mining is a process used to discover interesting but not 
explicitly available, highly usable patterns embedded in both spatial and non- 
spatial data, which are possibly stored in a spatial database. An important 
application of spatial data mining methods is the extraction of knowledge from 
a Geographic Information System (GIS). INGENS (INductive GEographic 
iNformation System) is a prototype GIS which integrates data mining tools to 
assist users in their task of topographic map interpretation. The spatial data 
mining process is aimed at a user who controls the parameters of the process by 
means of a query written in a mining query language. In this paper, we present 
SDMOQL (Spatial Data Mining Object Query Language), a spatial data mining 
query language used in INGENS, whose design is based on the standard OQL 
(Object Query Language). Currently, SDMOQL supports two data mining 
tasks: inducing classification rules and discovering association rules. Eor both 
tasks the language permits the specification of the task-relevant data, the kind 
of knowledge to be mined, the background knowledge and the hierarchies, the 
interestingness measures and the visualization for discovered patterns. Some 
constraints on the query language are identified by the particular mining task. 
The syntax of the query language is described and the application to a real 
repository of maps is briefly reported. 



1 Introduction 

Spatial data are important in many applications, such as computer-aided design, 
image processing, VLSI, and GIS. This steady growth of spatial data is outpacing the 
human ability to interpret them. There is a pressing need for new techniques and tools 
to find implicit regularities hidden in the spatial data. 

Advances in spatial data structures [7], spatial reasoning [3], and computational 
geometry [22] have paved the way for the study of knowledge discovery in spatial 
data, and, more specifically, in geo-referenced data. Spatial data mining methods 
have been proposed for the extraction of implicit knowledge, spatial relations, or 
other patterns not explicitly stored in spatial databases [15]. Generally speaking, a 
spatial pattern is a pattern showing the interaction between two or more spatial 
objects or space-dependent attributes, according to a particular spacing or set of 
arrangements [1]. 

Knowledge discovered from spatial data may include classification rules, which 
describe the partition of the database into a given set of classes [14], clusters of spatial 
objects ([11], [24]), patterns describing spatial trends, that is, regular changes of one 
or more non-spatial attributes when moving away from a given start object [6], and 
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subgroup patterns, which identify subgroups of spatial objects with an unusual, an 
unexpected, or a deviating distribution of a target variable [13]. The problem of 
mining spatial association rules has been tackled by [14], who implemented the 
module Geo-associator of the spatial data mining system GeoMiner [9]. 

A database perspective on spatial data mining is given in the work by Ester et al. 
[6], who define a small set of database primitives for the manipulation of 
neighbourhood graphs and paths used in some spatial data mining systems. An 
Inductive Logic Programming (ILP) perspective on spatial data mining is reported in 
[21], which proposes a logical framework for spatial association rule mining. 

GIS offers an important application area where spatial data mining techniques can 
be effectively used. In the work by Malerba et al. [20], it can be seen how some 
classification patterns, induced from georeferenced data, can be used in topographic 
map interpretation tasks. A prototype of GIS, named INGENS [19], has been built 
around this application. In INGENS the geographical data collection is organized 
according to an object-oriented data model and is stored in a commercial Object 
Oriented DBMS (ODBMS). 

INGENS data mining facilities support sophisticated end users in their topographic 
map interpretation tasks. In IN GENS, each time a user wants to query its database on 
some geographical objects not explicitly modelled, he/she can prospectively train the 
system to recognize such objects and to create a special user view. Training is based 
on a set of examples and counterexamples of geographical concepts of interest to the 
user (e.g., ravine or steep slopes). Such concepts are not explicitly modelled in the 
map legends, so they cannot be retrieved by simple queries. Furthermore, the user has 
serious difficulty formalizing their operational definitions. Therefore, it is necessary 
to rely on the support of a knowledge discovery system that generates some plausible 
“definitions”. The sophisticated user is simply asked to provide a set of (counter-) 
examples (e.g., map cells) and a number of parameters that define the data mining 
task more precisely. 

An INGENS user should not have problems, due to the integration of different 
technologies, such as data mining, OODBMS, and GIS. In general, to solve any such 
problems the use of data mining languages has been proposed, which interface users 
with the whole system and hide the different technologies [10]. However, the problem 
of designing a spatial mining language has received little attention in the literature. To 
our knowledge, the only spatial mining language is GMQL (Geo Mining Query 
Language) [16], which is based on DMQL (Data Mining Query Language) [8]. These 
languages have both been developed for mining knowledge from relational databases, 
so SQL remains the milestone on which their syntax and semantics are built. 

This paper presents SDMOQL (Spatial Data Mining Object Query Language) a 
spatial mining query language for INGENS sophisticated users. Its main 
characteristics are the following: 

It is based on OQL, the standard defined by ODMG (Object Database 
Management Group) for designing object oriented models (www.odmg.org). 

It interfaces relational data mining systems [2] that work with first-order 
representations of input data and output patterns. 

It separates the logical representation of spatial objects from their physical or 
geometrical representation. 
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The paper is organized as follows. INGENS architecture and conceptual database 
schema are described in the next section, while in Section 3 the spatial data mining 
process in INGENS is introduced. In Section 4 the syntax of SDMOQL is presented, 
while in Section 5 complete example of SDMOQL’ s use in INGENS is described. 
Einally, related works are discussed in Section 6. 



2 INGENS Architecture and Conceptual Database Schema 

The three-layered architecture of INGENS is illustrated in Fig. 1. The interface layer 
implements a Graphical User Interface (GUI), a java applet which allows the system 
to be accessed by the following four categories of users: 

Administrators, who are responsible for GIS management. 

Map maintenance users, whose main task is updating the Map Repository. 
Sophisticated end users, who can ask the system to learn operational definitions 
of geographical objects not explicitly modelled in the database. 

Casual end users, who occasionally access the database and may need different 
information each time. Casual users cannot train INGENS. 

Only sophisticated end-users are allowed to discover new patterns by using 
SDMOQL. 

The application enablers layer makes several facilities available to the four 
categories of INGENS users. In particular, the Map Descriptor is the application 
enabler responsible for the automated generation of first-order logic descriptions of 
some geographical objects. Descriptors generated by a Map Descriptor are called 
operational. The Data Mining Server provides a suite of data mining systems that can 
be run concurrently by multiple users to discover previously unknown, useful patterns 
in geographical data. In particular, the Data Mining Server provides sophisticated 
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Fig. 1. INGENS three-layered software architecture 

users with an inductive learning system, named ATRE [18], which can generate 
models of geographical objects from a set of training examples and counter-examples. 
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The Query Interpreter allows any user to formulate a query in SDMOQL. The query 
can refer to a specific map and can contain both predefined predicates and predicates, 
whose operational definition has already been learned. Therefore, it is the Query 
Interpreter’s responsibility to select the objects involved from the Map Repository, to 
ask the Map Descriptor to generate their logical descriptions and to invoke the 
inference engine of the Deductive Database, in order to check conditions expressed by 
both predefined and learned predicates. The Map Converter is a suite of tools which 
supports the acquisition of maps from external sources. Currently, INGENS can 
export maps in Drawing Interchange Format (DXF) by Autodesk Inc. (WWW. 
autodesk.com) and can automatically acquire information from vectorized maps in the 
MAP87 format, defined by the Italian Military Geographic Institute (IGMI) 
(www.nettuno.it/fiera/igmi/igmit.htm). Since IGMTs maps contain static information on 
orographic, hydrographic and administrative boundaries alone, a Map Editor is 
required to integrate and/or modify this information. 

The resource layer controls the access to both the Knowledge Repository and the 
Map Repository. The former contains the operational definitions of geographical 
objects induced by the Data Mining Server. In INGFNS, different users can have 
different definitions of the same geographical object. Knowledge is expressed 
according to a relational representation paradigm and managed by an XSB-based 
deductive relational DBMS [23]. The Map Repository is the database instance that 
contains the actual collection of maps stored in the GIS. Geographic data are 
organized according to an object-oriented data model. The object-oriented DBMS 
used to store data is a commercial one (ObjectStore 5.0 by Object Design, Inc.), so 
that full use is made of a well-developed, technologically mature non-spatial DBMS. 
Moreover, an object-oriented technology facilitates the extension of the DBMS to 
accommodate management of geographical objects. The Map Storage Subsystem is 
involved in storing, updating and retrieving items in and from the map collection. As 
a resource manager, it represents the only access path to the data contained in the 
Map Repository and which are accessed by multiple, concurrent clients. 

Fach map is stored according to a hybrid tessellation - topological model. The 
tessellation model follows the usual topographic practice of superimposing a regular 
grid on a map, in order to simplify the localization process. Indeed, each map in the 
repository is divided into square cells of the same size. 

In the topological model of each cell it is possible to distinguish two different 
structural hierarchies: physical and logical. The physical hierarchy describes the 
geographical objects by means of the most appropriate physical entity, that is: point, 
line or region. The logical hierarchy expresses the semantics of geographical objects, 
independent of their physical representation. In the Map Repository, the logical 
hierarchy is represented by eight distinct classes of the database schema, each of 
which correspond to a geographic layer in a topographic map, namely hydrography, 
orography, land administration, vegetation, administrative (or political) boundary, 
ground transportation network, construction and built-up area. Objects of a layer are 
instances of more specific classes to which it is possible to associate a unique physical 
representation. For instance, the administrative boundary layer describes objects of 
one of the following subclasses: city, province, county or state. 

Finally, each geographical object in the map has both a physical structure and a 
logical structure. The former is concerned with the representation of the object on 
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some media by means of a point, a line or a region. Therefore, the physical structure 
of a cell associates the content of a cell with the physical hierarchy. On the other 
hand, the logical structure of a cell associates the content with the logical hierarchy, 
such as river, city, and so on. The logical structure is related to the map legend. 



3 Spatial Data Mining Process in INGENS 

The spatial data mining process in INGENS (see Fig. 2) is aimed at a user who 
controls the parameters of the process. Initially, the query written in SDMOQL is 
syntactically and semantically analyzed. Then the Map Descriptor generates a highly 
conceptual qualitative representation of the raw data stored in the object-oriented 
database (see Fig. 3). This representation is a conjunctive formula in a first-order 
logic language, whose atoms have the following syntax: = value, where /is 

a function symbol called descriptor, t. are terms and the value is taken from the range 
off. A set of descriptors used in INGENS is reported in Table 1. They can be roughly 
classified in spatial and non-spatial. 

According to their nature, it is possible to spatial descriptors as follows: 

Geometrical, if they depend on the computation of some metric/distance. Their 
domain is typically numeric. Examples are line_shape and extension. 
Topological, if they are relations that are invariant under the topological 
transformations (translation, rotation, and scaling). The type of their domain is 
nominal. Examples are region_to_region and point_to_region. 

Directional, if they concern orientation. The type of their domain can be either 
numerical or nominal. An example is geo graphic _direction. 

Locational, if they concern the location of objects. Locations are represented by 
numeric values that express co-ordinates. There are no examples of locational 
descriptors in Table 1. 

Some spatial descriptors are hybrid, in the sense that they merge properties of two 
or more categories above. For instance, the descriptor line_to_line that expresses 
conditions of parallelism and perpendicularity is both topological (it is invariant with 
respect to translation, rotation and scaling) and geometrical (it is based on the angle of 
incidence). 
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Fig. 2. Spatial data mining process in INGENS 
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contain(cl l,pc494_l l)=true, contain(cl l,ss296_l l)=true, contain(cl l,qu61_l l)=true, 
type_of(pc494_l l)=parcel, type_of(ss296_l l)=street, type_of(qu61_l l)=quote, 
subtype_of(pc494_l l)=cultivation, subtype_of(ss296_l l)=cart_road, 
subtype_of(ss296_l 1 )=cart_road, 

color(pc494_ll)=black, color(ss296_ll)=black, color(qu61_l l)=black, 
part_of(pc494_ll,xl)=true, part_of(ss296_ll,x68)=true, pait_of(qu61_ll,x69)=true, 
altitude(x8)=97.00, altitude(x69)=102.00, 

extension(x3)=101, extension(x68)=45.00, 

geographic_direction(x3)=north, geographic_direction(x68)=east, 
line_shape(x3)=straight, Iine_shape(x68)=straight, 
area(xl)=99962,..., area(x66)=l 16662, 
density(xl)=medium, density(x66)=high, 
distance(x3,x68)=80.00, distance(x62,x68)=87.00, 

line_to_line(x3,x68)=almost_parallel, line_to_line(x62,x68)=almost_parallel, 
region_to_region(x 1 ,x9)=disjoint, . . . , region_to_region(x45,x66)=disj oint, 
line_to_region(x 1 6,x 1 )=adjacent, . . . , line_to_region(x67,x66)=intersect, 
point_to_region(x2,x 1 )=outside, . . . , point_to_region(x69,x66)=outside 



Fig. 3. Raster and vector representation (above) and symbolic description of cell 1 1 (below). 
The cell is an example of a territory where there is a system of farms. The cell is extracted from 
a topographic chart (Canosa di Puglia 176 IV SW - Series M891) produced hy the Italian 
Geographic Military Institute (IGMI) at scale 1:25,000 and stored in INGENS 

In INGENS geo-referenced objects can also be described by three non-spatial 
descriptors are color, type_of and subtype_of. Finally, the descriptor part_of 
associates the physical structure to a logical object. For instance, in the description: 

type_of(sl)=street , part_of(sl,xl)=true, part_of(sl,x2)=true 

the constant si denotes a street which is physically represented by two lines, which 
are referred to as constants xl and x2. 

The operational semantics of the descriptors is based on a set of methods defined in 
the object-oriented model of the map repository. More details on the computational 
methods for their extraction are reported in [17]. 

This qualitative data representation can be easily translated into Datalog with built- 
in predicates [18]. Thanks to this transformation, it is possible to use the output of the 
Map Descriptor module in many relational data mining algorithms, which return 
spatial patterns expressed in a first-order language. Finally, the results of the mining 
process are presented to the user. The graphical feedback is very important in the 
analysis of the results. 
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Table 1. Set of descriptors extracted by the map descriptor module in INGENS 



Feature 


Meaning 


Type 


Domain 


Type 


Values 


contain(C,L) 


Cell C contains a 
logical object L 


Topological 

relation 


Boolean 


true, false 


part_of(L,F) 


Logical object L is 
composed by 
physical object F 


Topological 

relation 


Boolean 


true, false 


type_of(L) 


Type of L 


Non-spatial 

attribute 


Nominal 


33 nominal values 


subtype_of(L) 


Specialization of 
the type of L 


Non-spatial 

attribute 


Nominal 


101 nominal values 
specializing the 
type_of domain 


colorfL) 


Color of L 


Non-spatial 

attribute 


Nominal 


blue, brown, black 


area(F) 


Area of F 


Geometrical 

attribute 


Linear 


[O..MAX_AREA] 


density (F) 


Density of F 


Geometrical 

attribute 


Ordinal 


Symbolic names 
chosen by an expert 


extension(F) 


Extension of F 


Geometrical 

attribute 


Linear 


[O..MAX_EXT] 


geographic_direction(F) 


Geographic 
direction of F 


Directional 

attribute 


Nominal 


north, east, 
north_west, 
north_east 


line_shape(F) 


Shape of the linear 
object F 


Geometrical 

attribute 


Nominal 


straight, cuspidal, 
curvilinear 


altitude(F) 


Altitude of F 


Geometrical 

attribute 


Linear 


[0.. MAX_ALT] 


lineJoJinefFjTj) 


Spatial relation 
between two lines 
F| and F^ 


Hybrid 

relation 


Nominal 


almost parallel, 
almost 

perpendicular 


distance(Fj,Fj) 




Geometrical 

relation 


Linear 


[O..MAX_DIST] 


region_to_region(F,,Fj) 


Spatial relation 
between two 
regions F, and F^ 


Topological 

relation 


Nominal 


disjoint, meet, 
overlap, covers, 
contains, equal, 
covered_by, inside 


line_to_region(Fj,Fj) 


Spatial relation 
between a line Fj 
and a region F, 


Hybrid 

relation 


Nominal 


along_edge, 

intersect 


point_to_region(F|, F^) 


Spatial relation 
between a point Fj 
and a region F^ 


Topological 

relation 


Nominal 


inside, outside, 
on_boundary, 
on_vertex 
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4 Design of a Data Mining Language for INGENS 

SDMOQL is designed to support the interactive data mining process in INGENS. 
Designing a comprehensive data mining language is a challenging problem because 
data mining covers a wide spectrum of tasks, from data classification to mining 
association rules. The design of an effective data mining query language requires a 
deep understanding of the power, limitation and underlying mechanisms of the 
various kinds of data mining tasks. A data mining query language must incorporate a 
set of data mining primitives designed to facilitate efficient, fruitful knowledge 
discovery. Seven primitives have been considered as guidelines for the design of 
SDMOQL. They are: 

1. the set of objects relevant to a data mining task, 

2. the kind of knowledge to be mined, 

3. the set of descriptors to be extracted from a digital map 

4. the set of descriptors to be used for pattern description 

5. the background knowledge to be used in the discovery process, 

6. the concept hierarchies, 

7. the interestingness measures and thresholds for pattern evaluation, and 

8. the expected representation for visualizing the discovered patterns. 

These primitives correspond directly to as many non-terminal symbols of the 
definition of an SDMOQL statement, according to an extended BNF grammar. 
Indeed, the SDMOQL top-level syntax is the following: 

<SDMOQL> ::= <SDMOQL_Statement>; j <SDMOQL_Statement>; } 
<SDMOQL_Statement> :: = <Spatial_Data_Mining_Statement> 

I <Background_Knowledge> 

I <Hierarchy> 

I <Result_Displaying> 

<Spatial_Data_Mining_Statement> ::= <Object_Specification_Query> 

mine <Kind_of_Pattern> 
analyze <Primitive_descriptors> 
with descriptors <Pattern_descriptors> 

[ <Background_Knowledge> ] 

/ <PIierarchy> j 

[with <Interestingness_Measures> ] 

[ <Result_Displaying> ] 

where “[]” represents 0 or one occurrence and “{ }” represents 0 or more occurrences, 
and words in bold type represent keywords. In sections 4.2 to 4.8 the detailed syntax 
for each data mining primitive is both formally specified and explained through 
various examples of possible mining problems. 



4.1 Data Specification: General Principles 

The first step in defining a data mining task is the specification of the data on which 
mining is to be performed. Data mining query languages presented in the literature 
allows the user to specify, through an SQL query, a single data table, where each row 
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represents one unit of analysis^ while each column corresponds to a variable (i.e. an 
attribute) of the unit of analysis. Generally, no interaction between units of analysis is 
assumed. 

The situation is more complex in spatial data mining. First, units of analysis are 
spatial objects, which means that they are characterized, among other things, by 
spatial properties. 

Second, attributes of some spatial objects in the neighborhood of, or contained in, a 
unit of analysis may affect the values taken by attributes of the unit of analysis. 
Therefore, we need to distinguish units of analysis, which are the reference objects of 
an analysis, from other task-relevant spatial objects, and we need to represent 
interactions between them. In this context, the single table representation supported 
by traditional data mining query languages is totally inadequate, since different 
geographical objects may have different properties, which can be properly modelled 
by as many data tables as the number of object types. 

Third, in traditional data mining relatively simple transformations are required to 
obtain units of analysis from the units of observation^ explicitly stored in the database. 
The unit of observation is often the same as the unit of analysis, in which case no 
trasformation at all is required. On the contrary, in GIS research, the wealth of 
secondary data sources creates opportunities to conduct analyses with data from 
multiple units of observation. For instance, a major national study uses a form that 
collects information about each person in a dwelling and information about the 
housing structure, hence it collects data for two units of observation: persons and 
housing structures. From these data, different units of analysis may be constructed: 
household could be examined as a unit of analysis by combining data from people 
living in the same dwelling or family could be treated as the unit of analysis by 
combining data from all members in a dwelling sharing a familial relationship. Units 
of analysis can be constructed from units of observation by making explicit the spatial 
relations such as topological, distance and direction relations, which are implicitly 
defined by the location and the extension of spatial objects. 

Fourth, working at the level of stored data, that is, geometrical representations 
(points, lines and regions) of geographical objects, is often undesirable. The GIS user 
is interested in working at higher conceptual levels, where human-interpretable 
properties and relations between geographical objects are expressed. A typical 
example is represented by the possible relations between two roads, which either 
cross each other, or run parallel, or can be confluent, independently of the fact that 
they are physically represented as “lines” or “regions” in a map. 

To solve these problems, in SDMOQL the specification of the geographical 
objects (both reference and task-relevant) of interest for the data mining task (first 
primitive) is separated from the description of the units of analysis (third and fourth 
primitives). Each unit of analysis is described by means of both (non-)spatial 
properties and spatial relations between selected objects. First-order logic is adopted 
as representation language, since it overcomes the limitations of the single table 
representation. Some basic descriptors (see Table 1) are generated by the Map 



* In statistics, the unit of analysis is the basic entity or object about which generalizations are to 
be made based on an analysis and for which data are collected in the form of variables. 

^ The unit of observation is the entity in primary research that is observed and about which 
information is systematically collected. 
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Descriptor, to support complex transformations of the data stored in the database into 
descriptions of units of analysis. Their specification is given in the third primitive. 
However, these descriptors refer to the physical representation of geographical objects 
of interest. To produce high-level conceptual descriptions involving objects of the 
logical hierarchy, the user can specify a set of new descriptors on the basis of those 
extracted by the Map descriptors. New descriptors are specified in the background 
knowledge (fifth primitive) by means of logic programs. Moreover, it is possible to 
specify that final patterns should be described by means of these new descriptors in 
the fourth primitive. 



4.2 Task-Relevant Object Specification 

The selection of geographical objects is performed by means of simplified OQL 
queries with a SELECT-FROM- WHERE structure, namely: 

<Data_Specification_Query> ::= <Query_Statement> 

(UNION <Query_Statement>l 

<Query_Statement>::= SELECT <Object> {, <Object>} 

FROM <Class> /, <Class>} 

[WHERE <Conditions>] 

The SELECT clause should return objects of a class in the database schema 
corresponding to a cell, a layer or a type of logical object. Therefore, the selection of 
object properties such as the attribute river_name of a river, is not permitted. 
Moreover, the selected objects must belong to the same symbolic level (cell, layer or 
logic object). More formally the FROM clause can contain either a group of Cells or a 
set of Layers, or a set of Logic Objects, but never a mixture of them. Whenever the 
generation of the descriptions of objects belonging to different symbolic levels is 
necessary, the user can obtain it by means of the UNION operator. The following are 
examples of valid data queries: 

Example 1: Cell-level query. The user selects cell 26 from the topographic map of 
Canosa (Apulia) and the Map Descriptor generates the description of all the objects in 
this cell. 

SELECT X 
FROM X in Cell 

WHERE x->num_cell = 26 AND x->part_map->map_name = “Canosa” 

Example 2: Layer-level query. The user selects the Orography layer from the 
topographic map of Canosa and the Construction layer from any map. The Map 
Descriptor generates the description of the objects in these layers. 

SELECT X, y 

FROM X in Horography, y in Construction 

WHERE x->part_map->map_name = “Canosa” 

Example 3: Object-level query. The user selects the objects of the logic class River 
and the objects of type motorway (instances of the class Road), from cell 26 of the 
topographic map of Canosa. The Map Descriptor generates the description of these 
objects. 
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SELECT X, y 

EROM X in River, y in Road 

WHERE x->part_map->map_name = “Canosa” 

AND y->part_map->map_name = “Canosa” 

AND x->log_incell->num_cell = 26 AND y->log_incell->num_cell = 26 
AND y->type_road = “motorway” 

The above queries do not present semantic problems. However, the next example is 
an OQL query which is syntactically correct but selects data that cannot be a valid 
input to the Map Descriptor. 

Example 4: Semantically ambiguous query. 

SELECT X, y 

EROM X in Cell, y in River 

WHERE x->num_cell = 26 AND y->log_incell->num_cell = 26 

This query selects the object cell 26 and all rivers in it. However, it is unclear 
whether the Map Descriptor should describe the entire cell 26 or only the rivers in it, 
or both. In the first case, a cell-level query must be formulated (see example 1). In the 
second case, an object-level query produces the desired results (see example 3). In the 
(unusual) case that both kinds of descriptions have to be generated, the problem can 
be solved by the UNION operator, applied to the cell-level query and the object-level 
query. Therefore, the following constraint is imposed on SDMOQL; the selected data 
must belong to the same symbolic level (cell, layer or logic object). More formally the 
EROM clause can contain either a group of Cells or a set of Layers, or a set of Logic 
Objects, but never a mixture of them. 

The next example is useful to present the constraints imposed on the SELECT 
clause. 

Example 5: Attributes in the SELECT clause. 

SELECT x.name_river 
EROM X in River 

The query selects the names of all the rivers stored in the database. The result set 
contains attributes and not geographic objects to be described by a set of attributes 
and relations. In order to select proper input data for the Map Descriptor, the SELECT 
clause should return objects of a class in the database schema corresponding to a cell, 
a layer or a type of logical object. It might be observed that the presence of an 
attribute in the SELECT clause can be justified when its type corresponds to a class. 
Eor instance, the following query: 

SELECT x->River 
EROM X in Cell 
WHERE x->num_cell = 26 

concerns all rivers in cell 26. Nevertheless, thanks to inverse relations (inverse 
members) characterizing an object model, it is possible to reformulate it as follows: 

SELECT X 
EROM X in River 

WHERE x->log_incell->num_cell = 26 

In this way, all the above constraints should be respected. 




106 



D. Malerba et al. 



4.3 The Kind of Knowledge to be Mined 

The kind of knowledge to be mined determines the data mining task in hand. For 
instance, classification rules or decision trees are used in classification tasks, while 
association rules or complex correlation coefficients are extracted in association tasks. 
Currently, SDMOQL supports the generation of either classification rules^ or 
association rules, which means that only two different mining problems can be solved 
in INGENS: the former has a predictive nature, while the latter is descriptive. The 
top-level syntax is defined below: 

<Kind_of_Pattern> ::= <Classification_Rules> I <Association_Rules> 

The non-terminal <Classification_Rules> specifies that patterns to be mined 
concern a classification task 

<Classification_Rules> ::= classification as <Pattern_Name> 

for <Classification_Concept> {, <Classification_Concept> j 

In a classification task, the user may be interested in inducing a set of classification 
rules for a subset of the classes (or concepts) to which training examples belong. 
Typically, the user specifies both “positive” and “negative” examples, that is, he/she 
specifies examples of two different classes, but he/she is interested in classification 
rules for the “positive” class alone. In this case, the subset of interest for the user is 
specified in the <Classification_Concept> list. 

In SDMOQL, spatial association rule mining tasks are specified as follows: 

<Association_Rules> ::= association as <Pattern_Name> 

key is <Descriptor> 

As pointed out, spatial association rules define spatial patterns involving both 
reference objects and task-relevant objects [21]. For instance, a user may be 
interested in describing a given area by finding associations between large towns 
(reference objects) and spatial objects in the road network, hydrography, and 
administrative boundary layers (task-relevant objects). The atom denoting the 
reference objects is called key atom. The predicate name of the key atom is specified 
in the key is clause. 

4.4 Specification of Primitive and Pattern Descriptors 

The analyze clause specifies what descriptors, among those automatically generated 
by the Map Descriptor, can be used to describe the geographical objects extracted by 
means of the first primitive. The syntax of the analyze clause is the following: 
analyze <Primitive_descriptors> 

where: 

<Primitive_descriptors> ::= <Descriptor> {, <Descriptor>j 

parameters <Parameter_specs>{, <Parameter_specs> } 
<Descriptor> ::= <Predicate>/<Arity> 

<Parameter_specs> ::= <Parameter_name> threshold <Integer> 



^ Here, the term classification rule denotes the result of a supervised discrimination process. 
On the contrary, Han & Kamber [9,10] use the same term to denote the result of an 
unsupervised clustering process. 
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The specification of a set of parameters is required by the Map Descriptor to 
automatically generate some primitive descriptors. 

The language used to describe generated patters is specified by means of the 
following danse: 

with descriptors <Pattern_descriptors> 

where: 

<Pattern_descriptors> ::= <Descriptor_specification> 

{; <Descriptor_specification> j 

<Descriptor_specification> ::= <Descriptor> [cost <Integer>] I 

<Descriptor> [with <Terms_Spec>] 
<Terms_Spec>::= <Term_Spec>(,<Term_Spec>j 
<Term_Spec> ::= <Constant_Type> \<Variable_Type> 

<Constant_Type> ::= constant [<Value>] 

<Variable_Type> ::= variable mode <Variable_Mode> role < Variable _Role> 
<Variable_Mode> :: = old I new I diff 
<Variable_Role> :: = ro I tro 

The specification of descriptors to be used in the high-level conceptual descriptions 
can be of two types: either the name of the descriptor and its relative cost, or the name 
of the descriptor and the full specification of its arguments. The former is appropriate 
for classification tasks, while the latter is reqnired by association rule mining tasks. 

An example of a classification task activated by an SDMOQL statement is 
reported in Fig. 4. In this case, the Map Descriptor generates a symbolic description 
of some cells by using the predicates listed in the analyze clause. These are four 
concepts to be learned, namely class(_)=system_ofJ^arms, class(_)=fluvial_ 
landscape, class(_)=.royal_cattle_track, and class(_)=.system_of_clijfs. Here the 
function symbol class is unary and denotes the anonymous variables a la Prolog. 
The user can provide examples of these fonr classes, as well as of other classes. 
Examples of systems of farms are considered to be positive for the first concept in 
the list and negative for the others. The converse is true for examples of fluvial 
landscapes, royal cattle track and system of cliffs. Examples of other classes are 
considered to be connterexamples of all classes for which rules will be generated. The 
only requirement for the INGENS nser is the ability to detect and mark some cells 
that are instances of a class. Indeed, INGENS GUI allows the user both to formulate 
and run an SDMOQL query and to associate the description of each cell with a class. 

Rules generated for the four concepts are expressed by means of descriptors 
specified in the with descriptors list. They are specified by Prolog programs on the 
basis of descriptors generated by the Map Descriptor. Eor instance, the descriptor 
font_to _parceU2 has two argnments which denote two logical objects, a font and a 
parcel. The topological relation between the two logical objects is defined by means 
of the clause: 

font_to _parcel(Font, Parcel) = Top o graphic _Relation 
type_of(Font) = font, part_of(Font, Point) = true, 
type_of( Parcel) = parcel, part_of(Parcel,Region) = true, 
point_to_region( Point, Region) = Topographic_Relation. 
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SELECT X FROM x in Cell WHERE x->num_cell = 5 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 8 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 1 1 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 15 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 16 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 17 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 27 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 28 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 34 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 83 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 84 
UNION SELECT x FROM x in Cell WHERE x->num_cell = 89 
mine classification as MorphologicalElements 

for class(_)=.system_of_farms, class(_)=.fluvial_landscape, class(_)=.royal_cattle_track, 
class(_)=.system_of_cliffs 

analyze contain/2, part_of/2, type_of/l, subtype_of/l, color/1, ... , 

line_to_line/2, distance/2, line_to_region/2, ..., point_to_region/2 

parameters 

maxValuePointRegionClose threshold 300, 

minValueLineLong threshold 100, 

maxValueLinesClose threshold 300, 

minValueRegionLarge threshold 5000, 

maxValueRegionClose threshold 500 

with descriptors contain/2 cost 1; class/1 cost 0; sub type_of/l cost 0; 
parcel_to_parcel/2 cost 0; slope_to_slope/2 cost 0; . . . 
canal_to_parcel/2 cost 0; . . . ; font_to_parcel/2 cost 0; . . . 
define knowledge 

font_to_parcel(Font,Parcel) = Topographic_Relation 

type_of(Font) = font, part_of(Font,Point) = true, 
type_of(Parcel) = parcel, part_of(Parcel,Region) = true, 
point_to_region(Point,Region) = Topographic_Relation. 

criteria 

intermediate minimize negative_example_covered with tolerance 0.6, 

intermediate maximize positive_example_covered with tolerance 0.4, 

intermediate maximize selectors_of_clause with tolerance 0.3, 

intermediate minimize cost with tolerance 0.4 

final maximize positive_example_covered with tolerance 0.0, 

final maximize selectors_of_clause with tolerance 0.0, 

final minimize cost with tolerance 0.0 

maxstar threshold 25, 
consistent threshold 500, 
max_ps threshold 1 1 
recursion = off, 

verbosity = { off, off, on, on, on ) 

Fig. 4. A complete SDMOQL query for a classification task. In this case, the INGFNS user 
marks 5,11,34,42 as instances of the class “system of farms” and the cells 1, 2, 3, 4, 15, 16, 17 
as instances of the class “other” 

In association rule mining tasks, the specification of pattern descriptors correspond 
to the specification of a collection of atoms: predicateName(t^,..., tj, where the name 
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of the predicate corresponds to a <Descriptor>, while <Term_Spec> describes each 
term t, which can be either a constant or a variable. When the term is a variable the 
mode and role clauses indicate respectively the type of variable to add to the atom and 
its role In a unification process. Three different modes are possible: old when the 
introduced variable can be unified with an existing variable in the pattern, new when 
it is a not just present in the pattern, dijf when it is a new variable but its values are 
different from the values of a similar variable in the same pattern. Furthermore, the 
variable can fill the role of reference object (jo) or task-relevant object (tro) in a 
discovered pattern during the unification process. The is key clause specifies the atom 
which has the key role during the discovery process. The first term of the key object 
must be a variable with mode new and role ro. The following is an example of 
specification of pattern descriptors defined by an SDMOQL statement for : 

with descriptors 

contain/2 with variable mode old role ro, variable mode new role tro; 
type_of/2 with variable mode diff role tro, constant; 

is key with variable mode new role ro, constant cultivation; 

This specification helps to select only association rules where the descriptors 
contain/2 and type_of/2 occur. The first argument of a type_of is always a Ji/f variable 
denoting a spatial object, and it can play the roles of both ro and tro, whereas the 
second argument, i.e. the type of object, is the constant ‘cultivation’, if the first 
argument is a reference object, otherwise it is any other constant. The predicate 
contain links the ro of type cultivation with other spatial objects contained in the 
cultivation. The following association rule: 

type_of(X, cultivation), contain(X,Y), type_of(Y,olive_tree), XY contain(X,Z), 
type_of(Z,almond_tree), XZ, YX 

satisfies the constraints of the specification and express the co-presence of both 
almond trees and olive-trees in some extensive cultivations. 

4.5 Syntax for Background Knowledge and Concept Hierarchy Specification 

Many data mining algorithms use background knowledge or concept hierarchies to 
discover interesting patterns. Background knowledge is provided by a domain expert 
on the domain to be mined. It can be useful in the discovery process. The SDMOQL 
syntax for background knowledge specification is the following: 

<Background_Knowledge> ::= [<New_Knowledge>] (<Use_Knowledge>j 
<New_Knowledge> ::= define knowledge <Clause> {, <Clause>j 
<Use_Knowledge> :: = use background knowledge of users <User> {, <User>} 
on <Descriptor> (, <Descriptor>} 

In INGENS, the user can define a new background knowledge expressed as a set of 
definite clauses; alternatively, he/she can specify a set of rules explicitly stored in a 
deductive database and possibly mined in a previous step. The following is an 
example of a background knowledge specification: 

Example 7: Definition o/close_to and import of the definition o/ ravine. 
close_to(X,Y)=true :- region_to_region(X,Y)=meet. 
close_to(X,Y)=true :- close_to(Y,X)=true. 
use background knowledge of users UserNamel on ravine/1 
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Concept hierarchies allow knowledge mining at multiple abstraction levels. In 
order to accommodate the different viewpoints of users regarding the data, there may 
be more than one concept hierarchy per attribute or dimension. For instance, some 
users may prefer to organize census districts by wards and districts, while others may 
prefer to organize them according to their main purpose (industrial area, residential 
area, and so on). There are four major types of concept hierarchies [8]: 

Schema hierarchies, which define total or partial orders among attributes in the 
database schema. 

Set-grouping hierarchies, which organize values for given attributes or 
dimensions into groups of constants or range values. 

Operation-derived hierarchies, which are based on operations specified by 
experts or data mining systems. 

Rule-based hierarchies, which occur when either a whole concept or a portion 
of it is defined by a set of rules. 

In SDMOQL a specific syntax is defined for the first two types of hierarchies: 
<Hierarchy> ::= [<New_Hierarchy>] [<Use_Hierarchy>] 

<New_Hierarchy> ::= define hierarchy <Schema_Hierarchy> 

I define hierarchy for <Set_Grouping_Hierarchy> 
<Use_Hierarchy> ::= use hierarchy <Name_Hierarchy> of user <User> 

The following example shows how to define some hierarchies in SDMOQL. 

Example 8: A definition of a schema hierarchy for some activity-related attributes 
and a set-grouping hierarchy for the descriptor distance. 

define hierarchy Activity as 

levell:{business_activity, other_activity } < levelO: Activity; 
level2:{low_business_activity,high_business_activity}<levell:business_activity; 
define hierarchy Distance for distance/2 as 
levell:{far, near} < levelO: Distance; 
level2:{0, 1999} <levell: near; 
level2:{2000, +inf} < levell: far; 

The activity hierarchy can be used to mine multi-level spatial association rules [21]. 



4.6 Syntax for Interestingness Measure Specification 

The user can control the data mining process by specifying interestingness measures 
for data patterns and their corresponding thresholds. The SDMOQL syntax is the 
following: 

<Interestingness_Measures> ::= [<Criteria>] [<Thresholds>] ( <Settings> j 

<Criteria> ::= criteria (intermediate \ final) (minimize I maximize) 

<Parameter> with tolerance <Value> {, (intermediate \ final) 
(minimize I maximize) <Parameter> with tolerance <Value>} 

<Thresholds> :: = <Parameter> threshold <Threshold_Value> 

/, <Parameter> threshold <Threshold_Value>j 

<Settings> ::= <Parameter> = <String_Value> 
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Interestingness measures may include: threshold values, weights, search biases in 
the hypotheses space and algorithm- specific parameters. In particular the user can 
bias the search in the hypotheses space by a number of preference criteria, such as the 
maximization of the number of covered examples or the minimization of the number 
of variables in the body of a learned clause. He/she can also set thresholds such as 
confidence, support or number of learned concepts. Finally, the user can set the value 
of a generic input parameter of a data mining algorithm. 



4.7 Syntax for Visualization 

Data mining results should be displayed using rule visualization tools or some 
different output forms. SDMOQL provides the following primitives for displaying 
results in different forms: 

<Result_Displaying> :: = display as <Form> 

[at level <Int_Value> for <Hierarchy_Name> ], 

where <Form> describes the output form, for example, if-then rules or tree. 
Moreover, if a hierarchy is available, mined results can be represented at different 
concept levels. This is particularly true in the case of multiple-level association rules. 



5 Mining Classification Rules for Topographic Map 
Interpretation 



In the previous section, the syntax of SDMOQL has been defined. Here we present a 
data problem concerning the generation of classification rules for topographic map 
interpretation. Let us suppose that a GIS user needs to locate a “sistema poderale” 
(system of farms) in the large territory of his/her interest. This geographical object is 
not present in the GIS model, thus, only the specification of its operational definition 
will allow the GIS to find cells containing a system of farms in a vectorized map. 
Who can provide it? The user is not able to do so for a number of reasons. 

Firstly, providing the GIS with operational definitions of some environmental 
concepts is not a trivial task. Often only declarative and abstract definitions are 
available, which are difficult to compile into database queries. 

Secondly, the operational definitions of some geographical objects are strongly 
dependent on the data model that is adopted by the GIS. Finding relationships 
between density of vegetation and climate is easier with a raster data model, while 
determining the usual orientation of some morphological elements is simpler in a 
topological data model. 

Thirdly, different applications of a GIS require the recognition of different 
geographical elements in a map. Providing the system in advance with all the 
knowledge required for its various application domains is simply illusory, especially 
in the case of wide-ranging projects such as those set up by governmental agencies. 

A solution to these problems can be found in the application of data mining 
techniques. For instance, an INGENS user can train the system to recognize cells with 
systems of farms, by performing the SDMOQL query in Fig. 4. The interpreter 
analyzes the query and verifies its syntactic and semantic correctness. Then the Map 
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Descriptor generates a symbolic description for each specified cell (see Fig. 3) and 
the expert associates each symbolic description with a concept, in order to define the 
training set. Association is made hy binding variable terms of one of the four concepts 
to he learned to the constants terms in the descriptions of map cells. This step is 
necessary to create the training set of positive and negative examples for the learning 
system ATRE [18], which is used in INGENS for classification tasks. The user marks 
5, 11, 34 as instances of the class “system of farms”, the cells 8, 16, 17 as instances of 
the class “fluvial landscape”, the cells 15, 27, 28 as instances of the class “royal cattle 
track” and the cells 83, 84, 89 as instances of the class “system of Cliffs”. This 
binding function is supported by INGENS GUI. The training set obtained is input to 
ATRE, which returns the classification rules. With reference to the above query, 
ATRE generates the following clauses: 

class(Xl ) = system_of_farms 
contain(Xl,X2)= true, 
area _parcel(X2) in [102.787. .249. 525], 
density _parcel(X2 ) = high, 
font_to _parcel(X3,X2) = outside, 

“A cell is an example of a system of farms if it contains a parcel (X2) that has an 
area between 102,787 and 249,525 square meters and a high vegetation density, and a 
font (X3) that is outside the parcel.” 

class(Xl ) = .fluvial_landscape 
contain(Xl,X2) = true, 
extension_road(X3 ) in [234.0. .440.0], 

canal_to_road(X2,X3 ) = almost _parallel, 
distance_canal_to_road(X2,X3 ) in [42.0. .300.0]. 

“A cell is an example of a. fluvial landscape if it contains a canal (X2) and a street 
(X3). The street has an extension between 234.0 and 440.0 meters and is almost 
parallel to the canal. In particular, the distance between the canal and the street is 
between 42.0 and 300.0 meters.” 

class(Xl) = royal_cattle_track 
contain(Xl,X2) = true, 
extension_road(X2) in [1002.0. .1 162.0], 
subtype_of(X2) = main_road. 

“A cell is an example of a royal cattle track if it contains a street (X2) that is a 
main road and has an extension between 1002.0 and 1 162.0 meters.” 

class(Xl)= system_of_cliffs 
contain(Xl,X2) = true, 

distance_contour_slope_to_contour_slope(X2,X3 )in ]2. 0. . 74.0], 
extension_contour_slope(X2) in [79.0. .307.0]. 

“A cell is an example of a system of cliffs if it contains two contour slopes (X2, 
X3), such that the distance between them is between 79.0 and 307.0 meters. One 
contour slope (X2) has an extension between 2,0 and 74,0 meters.” 
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Whether the induced theory is “correct”, that is, whether it classifies correctly all 
other examples of map cells not in the training set is beyond the scope of this work. 
However, it is noteworthy that these rules are coherent with the definitions given by 
town planners for the four morphological concepts of interest [19]. 

Operational definitions like those reported above can be used either to retrieve new 
instances of the learned concepts from the Map Repository or to facilitate the 
formulation of a query involving geographical objects not present in map legends. For 
instance, by submitting the following query: 

SELECT C 

FROM M in Map, C in Cell, R in Road 

WHERE M->name = “Canosa" AND C->map = M AND R->log_incell = C 
AND R->type_road= “main_road” AND class(C) = fluvial_landscape 

the user asks INGENS to find all cells in the Canosa map that are classified as fluvial 
landscape and contain a main road. To check the condition defined by the predicate 
class(C)=fluvial_landscape, the Query Interpreter generates the symbolic description 
of each cell in the map and asks the Query Engine of the Deductive Database to prove 
the goal class(C)=fluvial_landscape given the logic program above. 



6 Related Work 

Several data mining query languages have been proposed in the literature. MSQL is a 
rule query language proposed by Imielinski and Virmani [12] for relational databases. 
It satisfies the closure property, that is, the result of a query is a relation that can be 
queried further. Moreover, a cross-over between data and rules is supported, which 
means that there are primitives in the language that can map generated rules back to 
the source data, and vice versa. The combined result of these two properties is that a 
data mining query can be nested within a regular relational query. SDMOQL do not 
allow users to formulate nested queries, however, as pointed out at the end of the 
previous section, it supports some form of cross-over between data and mined rules. 
This is obtained by integrating deductive inferences for extracted rules with data 
selection queries expressed in OQL. 

Another data mining query language for relational databases is DMQL [8]. Its 
design is based on five primitives, namely the set of data relevant to a data mining 
task, the kind of knowledge to be mined, the background knowledge to be used in the 
discovery process, the concept hierarchies, the interestingness measures and 
thresholds for pattern evaluation. As explained in Section 4.1, the design of 
SDMOQL is based on a different set of principles. In particular, the specification of 
data relevant to a data mining task involves a separate specification for the 
geographical objects of interest for the application, for the set of automatically 
generated (primitive) descriptors and for the set of descriptors used to specify the 
patterns. An additional design principle is that of visualization, since in spatial data 
mining it is important to specify whether results have to be visualized or presented in 
a textual form. 

GMQL is based on DMQL and allows the user to specify the set of relevant data 
for the mining process, the type of knowledge to be discovered, the thresholds to filter 
out interesting rules, and the concept hierarchies as the background knowledge [16]. 
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In the process of selecting data relevant to the mining task, the user has to specify (1) 
the relevant tables, (2) the conditions that are satisfied by the relevant objects and (3) 
the properties of the objects which the mining process is based on. Conditions may 
involve spatial predicates on topological relations, distance relations and direction 
relations. Although data can be selected from several tables, mining is performed only 
on a single table which result from an SQL query (single table assumption). GMQL 
queries can generate different types of knowledge, namely characteristic rules, 
comparison rules, clustering rules and classification rules. In Koperski’s thesis, an 
extension to association rules was also proposed but not implemented. Differently 
from GMQL, SDMOQL separate the physical representation of geographical objects 
from their logical meaning. Moreover, all observations reported above for DMQL 
applies to GMQL as well. 

Finally, it is noteworthy that some object-oriented extension of DMQL, named 
ODMQL, has also been proposed [4]. The design of ODMQL is based on the same 
primitives used for DMQL, so the main innovation is that each primitive is in an 
OQL-like syntax. Path expressions are supported in ODMQL, while more advanced 
features of object-oriented query languages, such as the use of collections and 
methods, are not mentioned. An interesting aspect of ODMQL, which will be taken 
into account in further developments of SDMOQL, is that some concept hierarchies 
are automatically defined by the inheritance hierarchy of classes. 



7 Conclusions 

In this paper, a spatial data mining language for a prototypical GIS with knowledge 
discovery facilities has been partially presented. This language is based on a 
simplified OQL syntax and is defined in terms of the eight data mining primitives. For 
a given query, these primitives define the set of objects relevant to a data mining task, 
the kind of knowledge to be mined, the set of descriptors to be extracted from a digital 
map, the set of descriptors to be used for pattern description, the background 
knowledge to be used in the discovery process, the concept hierarchies, the 
interestingness measures and thresholds for pattern evaluation, and the expected 
representation for visualizing the discovered patterns. An interpreter of this language 
has been developed in the system INGENS. It interfaces a Map Descriptor module 
that can generate a first-order logic description of selected geographical objects. A 
full example of the query formulation and its results has been reported for a 
classification task used in the qualitative interpretation of topographic maps. An 
extension of this language to other spatial data mining tasks supporting quantitative 
interpretation of maps is planned for the near future. 
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Abstract. An inductive query specifies a set of constraints that patterns 
should satisfy. We study a novel type of inductive query that consists of 
arbitrary boolean expressions over monotonic and anti-monotonic prim- 
itives. One such query asks for all patterns that have a frequency of 
at least 50 on the positive examples and of at most 3 on the negative 
examples. 

We investigate the properties of the solution spaces of boolean induc- 
tive queries. More specifically, we show that the solution space w.r.t. a 
conjunctive query is a version space, which can be represented by its bor- 
der sets, and that the solution space w.r.t. an arbitrary boolean inductive 
query corresponds to a union of version spaces. We then discuss the role 
of operations on version spaces (and their border sets) in computing the 
solution space w.r.t. a given query. We conclude by formulating some 
thoughts on query optimization. 

Keywords: inductive databases, inductive querying, constraint based 
mining, version spaces, convex spaces. 



1 Introduction 

The concept of inductive databases was introduced by Imielinski and Mannila in 
[23]. Inductive databases are databases that contain both patterns and data. In 
addition, they provide an inductive query language in which the user cannot only 
query the data that resides in the database but also mine the patterns of interest 
that hold in the data. The long-term goal of inductive database research is to 
put data mining on the same methodological grounds as databases. Despite the 
introduction of a number of interesting query languages for inductive databases, 
cf. [10,12,14,15,20,18,6,27], we are still far away from a general theory of inductive 
databases. 

The problem of inductive querying can be described as follows. Given is 
an inductive database V (a collection of data sets), a language of patterns C, 
and an inductive query q{<j)) expressed in an inductive query language. The 
result of the inductive query is the set of patterns Th{q{T),£,T>) = {r G £ ] 

R. Meo et al. (Eds.): Database Support for Data Mining Applications, LNAI 2682, pp. 117— 134, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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(/(r) is true in database T>}. It contains all patterns r in the language C that 
satisfy the inductive query q in the database T>^. Boolean inductive queries are 
arbitrary boolean expressions over monotonic or anti-monotonic constraints on 
a single pattern variable t. An example query could ask for all patterns r whose 
frequency is larger than 100 in data set D\ and whose frequency in data set or 
H3 is lower than 20, where the data sets Di belong the inductive database T>. This 
type of query is motivated by our earlier MolFea system for molecular feature 
mining [18]. However, in MolFea, only conjunctive queries were handled. To the 
best of the author’s knowledge, boolean inductive queries are the most general 
form of query that have been considered so far in the data mining literature (but 
see also [9,12,10]). 

We investigate the properties of the solution space Th{q,C,'D). More specif- 
ically, for conjunctive queries q the solution space is a version space, which can 
be compactly represented using its border sets S and G, cf. [18,8,22,21,24,25,29]. 
For arbitrary boolean inductive queries, Th{q, C, T>) is a union of version spaces. 
Furthermore, solution spaces and their border sets can be manipulated using op- 
erations on version spaces, such as those studied by Hirsh [25] and Gunter et al. 
[13]. These operations are used in an algorithm for computing the solution spaces 
w.r.t. an inductive query. The efficiency (and the borders that characterize the 
outcome) of this algorithm depend not only on the semantics of the query but 
also on its syntactic form. This raises the issue of query optimization, which is 
concerned with transforming the original query into an efficient and semantically 
equivalent form. This situation is akin to that in traditional databases and we 
formulate some initial ideas in this direction, cf. also [9]. 

This paper is organized as follows. In Section 2, we briefly review the MolFea 
system, which forms the motivation for much of this work, in Section 3, we 
introduce some basic terminology, in Section 4, we discuss boundary sets, in 
Section 5, we present an algorithm to evaluate queries, in Section 6, we discuss 
query optimization, and finally, in Section 7, we conclude. 



2 A Motivating Example 

MolFea is a domain specific inductive database for mining features of interest 
in sets of molecules. The examples in MolFea are thus molecules, and the pat- 
terns are molecular fragments. More specifically, in [18] we employed the 2D 
structure of molecules, and linear sequences of atoms and bonds as fragments. 
An example molecule named AZT, a commonly used drug against HIV, is illus- 
trated in Figure 1. Two interesting molecular fragments discovered using MolFea 
are: 



‘ N=N=N-C-C-C-n : c : c : c=0 ’ 
‘ N=N=N-C-C-C-n : c : n : c=0 ’ 



^ This notation is adapted from that introduced by Mannila and Toivonen [28]. 
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Fig. 1. Chemical structure of azidothymidine 



In these fragments, 'C’, 'N’, 'Cl’, etc. denote elements^ , and denotes a 
single bond, ' = ’ a double bond, '#’ a triple bond, and ' : ’ an aromatic bond. 
The two fragments occur in AZT because there exist labelled paths in AZT that 
corresponds to these fragments. 

These two patterns have been discovered using MolFea in a database con- 
taining over 40 000 molecules that have been tested in vitro for activity against 
HIV, cf. [18] for more details. On the basis of these tests, the molecules have been 
divided into three categories: confirmed active CA, moderately active MA and 
inactive I. The above patterns are solutions to a query of the form freq{r, CA) > 
X A freq{T,I) < y where x,y are thresholds. So, we were interested in finding 
those fragments that occur frequently in the data sets CA (consisting of about 
400 substances) and that are infrequent in I (consisting of more than 40 000 sub- 
stances). In order to answer such conjunctive queries, MolFea employs the level 
wise version space algorithm, cf. also Section 5.1 and [8]. It should be noted that 
MolFea has also been employed on a wide range of other molecular data sets, cf. 
[17]. In such applications, one is often interested in classification. When this is the 
case, one can proceed by first deriving interesting patterns using the type of query 
specified above and then using the discovered patterns as boolean attributes in 
a predictive data mining system, such as a decision tree or a support vector 
machine. As shown in [17], effective predictors can often be obtained using this 
method. Even though MolFea has proven to be quite effective on various real-life 
applications, it is limited in that it only processes conjunctive queries. In this pa- 
per, we study how this restriction can be lifted and consider arbitrary boolean in- 
ductive queries. The resulting framework can directly be incorporated in MolFea. 



2 



Elements involved in aromatic bonds are written in lower-case. 
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3 Formalization 

A useful abstraction of MolFea patterns and examples, which we will employ 
throughout this paper for illustration purposes, is the pattern domain of strings. 
This pattern domain should also be useful for other applications, e.g., concerning 
DNA/RNA, proteins or other bioinformatics applications. In the string pattern 
domain, examples as well as patterns are strings expressed in a language Ls = 
S* over an alphabet S. Furthermore, a pattern p matches or covers an example 
e if and only if p is a substring of e, i.e., the symbols of p occur at consecutive 
positions in e. An inductive database may consist of different data sets. These 
data sets may correspond to different classes of data. E.g., in MolFea, we have 
analyzed classes of confirmed active, moderately active and inactive molecules 
w.r.t. HIV-activity [ 18 ]. 

Example 1 . A toy database that we will be using throughout this paper is 

— 6i = aabbcc; 62 = abbc; 63 = bb; 64 = abc; 65 = be; cq = cc 

— Di = {61,62,63} = {aahbcc,ahbc,bb}; 

D2 = {64, 65, 65} = {abc, be, cc}; 

D3 = D\ U D2 

— Pi = abb; p2 = bb; ps = cc 

— V = [Di,D2, D3} 



In addition to using the pattern domain of strings, we will employ the data 
miner’s favorite item sets [ 1 ]. Then, if X is the set of items considered, examples 
6 as well as patterns p are subsets of X and the language of all patterns £ = 2 ^. 
Furthermore, the database may then consist of various data sets, i.e., sets of 
item sets. 

Throughout this paper, we will allow for inductive queries q{T) that contain 
exactly one pattern variable r. These queries can be interpreted as sets of pat- 
terns Th{q{T), L,V) ={t € L \ q{r) is true in T>}, and for compactness reasons 
we will often write q{T) to denote this solution set. Inductive queries are boolean 
expressions over atomic queries. An atomic query q{r), or atom for short, is a 
logical atom p{t\ , tn) over a predicate p of arity n where n— 1 of the arguments 
ti are specified (i.e. ground) and one of the arguments ti is the pattern variable 
T. Let us now - by means of illustration - review MolFea’s atomic queries. 

— Let g and s be strings. Then g is more general than s, notation p ^ s, if and 
only if p is a substring of s. E.g., on our earlier example, p2 pi evaluates to 
true, and ps pi to false. This predicate — defined in the context of strings 
— applies to virtually any pattern domain. It corresponds to the well-known 
generality relation in concept-learning, cf., [22]. 

— The ^ relation can now be used in atomic queries of the type r ^ p, ^(r =4 p), 
p ^ T, and ^(p =4 t), where r is a pattern variable denoting the target pattern 
and p a specific pattern. E.g., r ^ abc yields as solutions the set of substrings 
of abc. 
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~ Let p be a pattern and D a data set, i.e. a set of examples. Then freq{p, D) = 
card{e G D \ p ^ e}, where card{S) denotes the cardinality of the set S. 
So, freq{p, D) denotes the number of instances in D covered by p, i.e. the 
frequency of p in D. E.g., freq{pz, D 2 ) = 1. 

~ The/reg construct can now be used in atoms of the following form: freq{r, D) 
> t and freq(r,D) < t where t is a numerical threshold, r is the queried pat- 
tern or string, and D is a given data set. E.g., /reg(r, D 2 ) > 2 yields the set 
of all substrings of be. 

Throughout this paper, as in most other works on constraint based mining, 
we will assume that all atoms are either monotonic or anti-monotonic. These 
notions are defined w.r.t. the generality relation ^ among patterns. Formally 
speaking, the generality relation is a quasi-ordering (i.e., a relation satisfying 
the reflexivity and transitivity relations). 

Definition 1. An atom p is monotonic if and only ifWx G C : (x ^ y) Ap{x) —>■ 

p{y)- 

Definition 2. An atom p is anti-monotonic if and only if \/x G C \ {x ^ y) A 

p{y) ^P{x). 

Let us illustrate these definitions on the domain of item sets. 

Example 2. Consider the lattice consisting of the item sets over {a, b, c, d} graph- 
ically illustrated in Figure 2. The frequencies of the item sets on data sets di 
and d 2 are specified between brackets. 



{) 

(10, 100) 




(a) (b) (c) (d) 

( 9 , 100 ) ( 10 , 80 ) ( 9 , 90 ) ( 9 , 80 ) 




{a, b} {a, c} {a, d} (b, c} (b, d} (c, d} 

( 8 , 70 ) ( 9 , 90 ) ( 7 , 70 ) ( 6 , 80 ) ( 5 , 40 ) ( 8 , 60 ) 




Fig. 2. An item set lattice with frequencies in data sets di and c ?2 
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Examples of anti-monotonic atoms include freq{T,di) > 7 for the item set 
domain and t ^ acd for the string domain. Monotonic ones include freq{r, ^2) < 
30 and r ^ {a, c, d} for the item set example. 

By now we can define boolean inductive queries. 

Definition 3. Any monotonic or anti-monotonic atom of the form p{t) is a 
query. Furthermore, ifqiir) and 92 (t) are queries over the same pattern variable 
T, then and <71 (t) V <72 (t) and <7i(t) A (72 (t) are also queries. 

The definition of the solution sets Th{q, C, V) can now be extended to boolean 
inductive queries in the natural way, i.e. 

- Th{qi{r) V g2(r), £, X>)= Th{qi{T), C,V) UT/i(g2(r), £, X>)= gi(r) U (72 (t); 

- Th{qi{T) Aq2{T),£,V)=Th{qi{T),£,V) r\Th{q2{T),£,V)= qi{r) n q2{T)] 

- Th{^q{T),£,V)= £ - Th{q(r),£,V)= £ - q(j). 

This definition implies that {{freq{T,di) > 5) V {freq{r,d2) < 90)) A (r ^ 
{a, &}) is a query and that its solution set can be computed as {{freq{T,di) > 
5) U {freq{T, ^2) < 90)) n (t ^ {a, 6}). Moreover, the solution set for this query 
w.r.t. Figure 2 is {{6}, {a, 5}}. 

Using boolean inductive queries over the predicates available in MolFea, one 
can formulate the following type of queries: 

~ Traditional minimal frequency queries can be performed using freq{r, D\) > 

2 ). 

- Complex queries such as {freq{T, Dpos) > n) A (/reg(r, Dneg) < rn) ask 
for the set of patterns that are frequent on the positive examples Dpos and 
infrequent on the negatives in Dneg- 

- Further syntactic constraints could be added. E.g., if we are interested only 
in patterns that are a substring of abababab and a superstring of ab we could 
refine the above query to {freq{T, Dpos) > n) A {freq(r, Dneg) ^ m) A (r ^ 
abababab) A {ab =4 t). 

In MolFea, inductive queries were required to be conjunctive. The techniques 
studied in this paper allow us to lift that restriction and to extend MolFea in 
order to answer any boolean inductive queries. 

Since one can consider inductive queries as atoms over a single pattern vari- 
able, we will sometimes be talking about anti-monotonic and monotonic queries. 
The set of all monotonic queries will be denoted as Ai and the set of all anti- 
monotonic ones by A. Some well-known properties of A and A4 are summarized 
in the following property. 

Property 1. //a, 01,02 G A and m,mi,m 2 G M., then ^o G Ai; ~^m G A; 
(oi V 02) G A; (oi A 02) G A; (mi V m 2 ) G AA and (mi A m2) G Ad. 
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4 Boundary Sets 

Within the fields of data mining and machine learning it is well-known that 
the space of solutions Th{q,C,T>) to specific types of inductive queries can be 
represented using boundary sets. E.g., Hirsh [24,25] has shown that sets that are 
convex and definite can be represented using their boundaries, Mellish [21] and 
Gunter et al. [13] also list conditions for the representability of sets using their 
borders in the context of version spaces, Mannila and Toivonen [28] have shown 
that the solution space for anti-monotonic queries can be represented using the 
set of minimally general elements. Furthermore, various algorithms exploit these 
properties for efficiently finding solutions to queries, cf. Bayardo’s MaxMiner [3] 
and Mitchell’s candidate-elimination algorithm [22] in the context of concept- 
learning. Most of the early work on version spaces was concerned with concept 
learning (i.e. queries of the form 

freq{T,Pos) >| Pos \ Afreq{r, Neg) < 0 

where Pos is a set of positive and Neg of negative examples). In data mining, 
one typically investigates queries of the form freq^T, D) > x. One of the key 
contributions of the MolFea framework was the realization that inductive con- 
junctive queries can - when their solution set is finite - be represented using 
boundary sets. Furthermore, MolFea employed a novel algorithm, the so-called 
level-wise version space algorithm for computing these boundary sets (cf. also 
Section 5.1). Here, we will study how to extend this framework for dealing with 
arbitrary boolean queries. 

The boundary sets, sometimes also called the borders, are the maximally 
specific (resp. general) patterns within the set. More formally, let P be a set of 
patterns. Then we denote the set of minimal (i.e. minimally specific) patterns 
within P as min{P). Dually, max{P) denotes the maximally specific patterns 
within P. 

In the remainder of this paper, we will be largely following the terminology 
and notation introduced by Mitchell [22], Hirsh [24] and Mellish [21]. 

Definition 4. The S-set S{P) w.r.t. a set of patterns P C C is defined as 
S{P) = max{P); and the G-set w.r.t. a set of patterns P C C is defined as 
G{P) = min{P). 

We will use this definition to characterize the solution sets of certain types of 
queries. 

Definition 5. A set T is upper boundary set representable if and only if 
T = {t € C \ 3g G min{T) : g ^ t\; it is lower boundary set representable 
if and only if T = {t G C \ 3s G max(T) : t =4 s}; and it is boundary set 
representable if and only if T = {t G C \ 3g G min(T), s G max(T) : g ^ t ^ s}. 

Sets that are boundary set representable are sometimes also referred to as 
version spaces. Remark that the above definition allows the border sets to be 
infinitely large. 
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Example 3. Consider the constraint freq{T,di) > 4 over the item sets in Figure 
2. Then S{freq{T,di) > 4) = {{a,d},{b,d},{c,d},{a,b,c}}; G{{a,b} =4 t) = 
{{a, 5}}, and finally, S{{freq{T,di) > 4) A ({a} ^ r)) = {{a, 5, c}, {a, d}} and 
S{{freq{T, di) > 4) A ({a, b} =4 r)) = {{a, b, c}}. 

Hirsh [25] has characterized version spaces using notions of convexity and 
definiteness as specified in Theorem 1. 

Definition 6. A set P is convex if and only if for all p\,p 2 € P,p G C : pi =4 
p ^ P 2 implies that p € P. 

Definition 7. A set P is definite if and only if for all p G P,3s G max{P),g G 
min{P) : g ^ p ^ s. 



Theorem 1 (Hirsh). A set is boundary set representable if and only if it is 
convex and definite. 

Let us now investigate the implications of Hirsh’s framework in the context 
of inductive queries. 

Lemma 1. Let a A m be a query such that a G A and m G Ai. Then Th{a A 
m,C,V) is convex. 

This property directly follows from the definitions of monotonicity and anti- 
monotonicity. If we want to employ border representation for representing solu- 
tion sets we also need that Th{a A m,£,T>) is definite. According to Hirsh [25], 
any finite set is also definite. 

Lemma 2. If P is a convex and finite set, then it is boundary set representable. 

This implies that the solution space for queries of the form a A m over the 
pattern language of item sets, is boundary representable. What about infinite 
pattern languages such as E*, the set of strings over the alphabet SI Consider 
the pattern set S* . This set is not definite because max{S*) does not exist. 
Indeed, for every string in S* there is a more specific one. One way to circumvent 
this problem is to introduce a special bottom element T that is by definition more 
specific than any other string in S* , cf. also [13,21], and to add it to the language 
of patterns L. However, this is not a complete solution to the problem. Indeed, 
consider the language S* over the alphabet S = {a, 6} and the set As = a* 
(where we employ the regular expression notation) . This set is again not definite 
because S ^ As and infinite strings (such as a°° ^ S*) are not strings. Thus 
max{As) does not exist. Furthermore, the set As is the solution to the inductive 
query ^{b ^ t). Other problems arise because the boundary set may be infinitely 
large. Consider, e.g., the constraint (a ^ t)A(& ^ t) over the language S* , where 
S = {a,b,c}. The G-set in this case would be ac*5U be* a because any string 
containing an a and a b must be more specific than a string of the form ac"& 
or bc^a. Thus, it seems that not all queries result in boundary set representable 
version spaces. This motivates our definition of safe queries. 
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Definition 8. A conjunctive query q is safe if and only if its solution set is 
boundary set representable and its borders are finite. 

This situation is akin to that in traditional databases, where some queries 
are also unsafe, e.g. queries for the set of tuples that do not belong to a given 
relation. 

One immediate question that arises is which queries are safe for a particular 
type of pattern language C and constraint. Let us first remark, that all queries 
in the case of finite languages (such as e.g. item sets) are safe. On the other 
hand, for infinite pattern languages (such as strings) and the MolFea primitives, 
we have that 

— T ^ s where s G If* is safe because S' = {s} 

— ^(r ^ s) with s G If* is safe because G = max{g G IfUlf^U...Ulf^Ulf^+^ | 
k = length{s) and ~^{g =4 s)}. This is a finite set as all of its constituents 
are. 

— s ^ T is safe because G = {s}. 

— ^(s ^ r) is not safe as shown above. 

~ freq{T, D) > f is safe, because if t > 1 then patterns r satisfying freq{r, D) > 
t must be substrings of strings in D and there are only a finite number of 
them; on the other hand if f = 0 then S = {T}, so S always exist and is 
finite. 

— freq{r, D) < t is safe, because there are only a finite number of substrings of 
D that are excluded from considerations. So, G = max{g G IfUlf^U...Ulf*U 
If'^+i I k = length{s) where s is the longest string in D and freq{g, D) < 
t}. This is a finite set. 

— Furthermore, if query q is of the form a A m where a € A and m G At for 
which a is safe, then q will also be safe. The reason for this is that a finite S 
restricts the solution set to being finite (as there exist only a finite number 
of substrings of strings in S) . 



5 Query Evaluation 

Let us now investigate how queries can be evaluated and how the boundary sets 
can be employed in this context. 

5.1 Conjunctive Queries 

The first observation is that - as argued above - the solution sets to safe conjunc- 
tive queries are completely characterized by their S and G-sets. Furthermore, 
there exist several algorithms that can be used to compute such border sets. 
First, it is well-known that the 5'-set or positive border for safe anti-monotonic 
queries can be computed using adaptations of the level-wise algorithm [19] such 
as the MaxMiner system by Bayardo [3] . Dual versions of these algorithms could 
be used to compute the G set w.r.t. monotonic safe queries, cf. [8,18]. Secondly, 
several algorithms exist that either compute the S and G set w.r.t. a safe con- 
junctive query, such as the level wise version space algorithm underlying MolFea 
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[8] . This algorithm combines the level- wise algorithm with principles of the candi- 
date elimination algorithm studied by Mellish and Mitchell [21,22]. Furthermore, 
within MolFea, the level-wise version space algorithm has been used to deal with 
reasonably large data sets, such as the HIV data set discussed in Section 2. In 
addition, several algorithms exist that would compute the whole solution set of 
a conjunctive query over a finite pattern domain, cf. [4,5]. It should be possi- 
ble to adapt these last algorithms in order to generate the borders only. In the 
remainder of this paper, we will not give any further details on how to com- 
pute a specific S and/or G-set as this problem has been studied thoroughly in 
the literature. Furthermore, we will assume that queries are safe unless stated 
otherwise. 



5.2 Boolean Queries 

Let us now address the real topic of this paper, which is how to compute and 
characterize the solution space w.r.t. general boolean queries. Before investi- 
gating these solution spaces more closely, let us mention that when q and q' 
are logically equivalent then their solution sets are identical. This is formalized 
below. 

Property 2. Let q and q' be two boolean queries that are logically equivalent. 
Then Th{q, C,T>)= Th{q' , C,T>) for any inductive database V. 

This property is quite useful in showing that the solution space of a safe 
general boolean query can be specified as the union of various convex sets. 

Let <7 be a boolean inductive query. Then Th{q, G, T>) will consist of the union 
of a finite number of convex sets UiTh{qi,C, T>). To see why this is the case, one 
can rewrite the query q into its disjunctive normal form q = gi V ... V g„. In this 
form each of the qi is a conjunctive query. Furthermore, if these sub-queries qi are 
safe, their solution sets Th{qi,C,V) form version spaces that can be represented 
by their border sets. This motivates the following definition and result. 

Definition 9. A boolean inductive query q is safe if and only if there exists an 
equivalent formula q' = qiV ...\J qk that is in DNF and for which all qi are safe. 

Notice that the fact that a query is safe does not imply that all possible 
rewrites in DNF will yield safe g*. Indeed, let a (resp. b) denote the atom (a ^ r) 
(resp. ^ r)). Then the query a is equivalent to (a A 6) V (a A ^b) of which the 
first conjunction is not safe. 

Lemma 3. Let q be a safe boolean query. Then Th{q, C,T>) is a union of version 
spaces, each of which can be represented by its boundary sets. 

The question of solving boolean queries is now reduced to that of computing 
the version spaces or their border sets. 
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5.3 Operations on Border Sets 

We will now employ a simple algebra that will allow us to compute the boundary 
sets w.r.t. given boolean queries. The algebraic operations (such as intersection 
and union) are not new. They have been studied before by Haym Hirsh [25] and 
Gunter et al. [13]. However, Hirsh and Gunter studied version spaces in quite 
different contexts such as concept-learning and in the case of Gunter et al. also 
truth maintenance. Our context, i.e. inductive querying using version spaces, is 
new and so is the incorporation of the results about safety and the use of the 
pattern domain of strings. 

We first show how the boundary sets of a safe query of the form q\Aq2 (where 
A is conjunction or disjunction) can be computed in terms of the boundary sets 
of the queries qi. 

Lemma 4 . (Disjunction) 

— If ai, a2 € A and ai and 02 are safe then oi Vo2 is safe and Th{a\ Va2, £, V) 
is lower boundary representable using S{ai V 02) = max{S{ai) U 8(02)) ■ 

— If nil, m2 G Ai and m\ and m2 are safe, then m\ V m2 is safe and Th{m\ V 
m2,C, V) is upper boundary set representable using G{mi\Jm2)=min{G{mi) 
U G(m2)). 

Let us first define the necessary operations used when intersecting two version 
spaces. Here mgg and mss denote the minimally general generalizations and 
maximally specific specializations of the two elements^. 

Definition 10 . (mgg and mss) 

— mgg{si, S2) = max{s G£js^SiAs^ S2}. 

— mss{gi,g2) = min{g & C \ gi A g g2 A g} ■ 

Lemma 5 . (Conjunction) 

— If ai,a2 & A and ai,Q2 anda\/\a2 are safe, then Th{ai A 02, C,T>) is lower 
boundary representable using S{aiAa2) = max{m \ m G mgg{s\, S2) where Si 
G S{ai)}. 

— If mi, m2 G Ai and mi, m2 and mi A m2 are safe then Th{mi A m2,C,T>) 
is upper boundary set representable using G{mi A m2) = min{m \ m G 
mss{gi,g2) where gi G G(wi)}. 

Lemma’s 4 and 5 motivate the introduction of the following operations: 

~ G(mi) V G(to2) = min{G{mi) U G(to2)) 

— S'(ai) V £'(02) = max{S{ai) U S{a2)) 

~ S{ai) A £(02) = max{m \ m G mgg{si, S2) where Si G £(aj)} 

— G(mi) A G(to2) = min{m \ m G mss{gi, g2) where gi G G(jni)}. 



® Gunter et al. [13] call this the quasi-meet and quasi-join, Mellish [21] uses aa and bb 
to denote these operations. 
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Notice that the conjunctive operation on border sets is - in general - not 
safe. An illustration is the query (a ^ r) A (6 ^ t) which we discussed earlier. 
The problem with this query is that even though both atoms are safe, their 
conjunction is not, due to an infinitely large G-set. 

Lemma 6 . // <71,(72 (md q\ A <72 are safe, then Th{q\ A q2,C,V) is boundary set 
representable using S{qi A <72) = S'(<7i) A S{q2) and G{qi A <72) = G(<7i) A G(<72). 

This operation corresponds to Hirsh’s version space intersection operation, 
which he employs in the context of concept-learning and explanation-based learn- 
ing. 

The corresponding property for disjunctive queries <71 V <72 does not hold in 
general, i.e., <71 V <72 is not necessarily a convex space. 

Example 4 - Consider that we have the version spaces V S\ and V S2 containing 
item-sets, and having the following boundary sets : = {{a,b,c,d}}]Gi = 

{{a}} and S'2 = {{c, <i, e, /}}; G2 = {{d, e}}. In this example, the set CS”! U 
V S2 is not boundary set representable. The reason is that the version space 
corresponding to these boundary sets would also include elements like {a, /} 
which do neither belong to V S\ nor to ¥82- 

In general, if we have a query that is a disjunction <71 V<72, we will have to keep 
track of two version spaces, that is, we need to compute 8{qi), G{qi) and S{q2) 
and G(<72). The semantics of such a union or a disjunction of version spaces is 
then that all elements that are in at least one of the version spaces are included. 

Now, everything is in place to formulate a procedure that will generate the 
boundary sets Th{q,L,V) corresponding to a query q. In this procedure, we 
assume that all queries and subqueries are safe. Furthermore, (S', G) denotes the 
version space characterized by the border sets S and G, i.e. the set {r G £ | 
3s G S, <7 G G : <7 A T A s}. Thus the notation (Si,Gi) U {82, G2) denotes 
the union of the two version spaces characterized by (Si, Gi) and (S2, G2). This 
notation can trivially be extended towards intersection. Furthermore, it is useful 
in combination with the operations on version spaces. Indeed, consider that 

(Si, Gi) n (S2, G2) = (Si A S2, Gi A G2) 

which denotes the fact that the intersection of the two version spaces is a version 
space obtained by performing the A operation on its corresponding border sets. 

The symbols T and T denote the sets of minimally, respectively maximally 
specific elements. These are assumed to exist. If they do not exist, as in the case 
of T for strings, one has to introduce an artificial one. 

By now we are able to formulate our algorithm, cf. Figure 3. It essentially 
computes the borders corresponding to the solution sets of the queries using the 
logical operations introduced earlier. 

Various rewrites can be applied to simplify the final result. As an example 
consider {8, Gi) U (S', G2) = (S, Gi U G2) and (Si, G) U (S2, G) = (Si U 82, G). 

Example 5 . To illustrate the above algorithm, consider the query {freq{T, di) > 
5) V (freq{T,d2) < 85). The algorithm would compute {{{a, b, c} , {b, d} , {c, d}} , 
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function vs{q) returns a set of boundary sets Si, Gi 
such that Th{q, C,V) = Ui(S'i,Gi) 
case q £ A do 

return {S{q),{T}) 
case q £ A4 do 

return ({_L},G(<j)) 

case g of a A m with a £ A,m £ M do 
return {S(a A m), G(a A m)) 
case q of the form gi A 52 with gi , 52 ^ AU A4 do 
call vs{qi) and 115(52) 

assume vs{qi) returns (Si,Gi) U ... U {Sn,Gn) 
assume vs{q 2 ) returns (S(, G'l) U ... U {S'^, G'^) 
return A S'j),max{G'i A G')) 

case q of the form gi V 52 with 51,^2 ^ AU A4 do 
call vs{qi) and 115(52) 

assume vs{qi) returns (Si,Gi) U ... U {S„,Gn) 

assume ^5(52) returns (Sj , G'l) U ... U {S'^, G'^) 

return (^i, Gi) U ... U {S„, G„) U (S') , G'l) U ... U (SL, G'^) 



Fig. 3. Computing the solution set w.r.t. an inductive query 

{{}}) for the first atom and {{{a,b,c,d}},{{b},{d}}) for the second one and 
then take the union (cf. the last case in the algorithm). 



6 Query Optimization 

The results and efficiency of the previous algorithm strongly depend on the 
syntactic form of the inductive query. Indeed, consider two queries 51 and 52 that 
are logically equivalent but syntactically different. Even though the solution sets 
are the same in both cases, the output representations of the boundaries can 
greatly differ. As an example consider the boolean expression qi = (oi V 02) A 
(mi V m2) and its logically equivalent formulation 52 = (oi A mi) V (oi A m2) V 
(o2 A mi) V (o2 A m2). The result of 51 will be represented as a single version 
space, because the query is of the form a Am (third case) . On the other hand, 
the result of 52 would be represented as union of four version spaces (last case) . 
Secondly, if one applies the algorithm in a naive way, various boundary sets may 
be recomputed using q2, e.g. the S'(oj) and G{mi). 

It is therefore advantageous to consider optimizing the queries. Optimization 
then consists of reformulating the query into a form that is 1) logically equivalent 
and 2 ) that is in a form that is more efficient to compute. When formulating 
query optimization in this form, there is a clear analogy with that of circuit 
design and optimization. Indeed, when designing logical circuits one also starts 
from a logical formula (or possibly a truth-table) and logically rewrites it in the 
desired optimized form. In this optimization process certain connectives are to be 
preferred and other ones are to be avoided. This is akin to the query optimization 
problem that we face here, where the different primitives and operations may 
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have different costs. It is thus likely that the solutions that exist for circuit 
design can be adapted towards those for query optimization. The connection 
with circuit design remains however the topic of further work. 

Below we discuss one form of optimization, due to De Raedt et al. [ 9 ], where 
the optimization criterion tries to minimize the number of calls to the level 
wise version space algorithm. In addition, some simple but powerful means of 
reasoning about queries are also discussed in Sections 6.2 and 6 . 3 . 



6.1 Minimizing the Number of Version Spaces 

In the context of query optimization, one desirable form of query is a A m where 
a & A and m € M. Let us call this type of query the version space normal 
form. If one can rewrite a query in the version space normal form, the result 
will be a single version space (provided that the query is safe) . We can then also 
compute the boundary sets using known algorithms such as the level wise version 
space algorithm by [8] and that by Boulicaut [ 5 ]. Notice that it is not always 
possible to do this. Consider e.g. a query of the form a V m where both a and m 
are atomic. A natural question that arises in this context is how many version 
spaces are - in the worst case^ - needed to represent the solution space to an 
inductive query. This problem is tackled in a recent paper by De Raedt et al. 
[ 9 ]. More formally, De Raedt et al. study the problem of reformulating a boolean 
inductive query q into a logically equivalent query of the form qi A ... A qk where 
each qi = tti A rrii is in version space normal form (i.e. is an anti-monotonic 
query and ruj a monotonic one) and such that k is minimal. The number k is 
called the dimension of the query q. This is useful for query optimization because 
using the reformulated query in the algorithm specified in Figure 3 , would result 
in the minimum number k of calls to an algorithm for computing a version space. 
As an example, consider the queries qi = (ai V 02) A {mi V m2) and its logically 
equivalent formulation <72 = (oi A mi) V (ai A m2) V (02 A ml) V (02 A m2). qi is 
in version space normal form, which shows that the dimension of both queries 
is 1 . Now, De Raedt et al. provide a procedure for determining the dimension 
of a query q and for rewriting it into subqueries qi that are in version space 
normal form. This procedure could be used to rewrite the query q2 into <71. 
More specifically, they show that how to obtain the Oj and m^ where each Oi 
is of the form (ai_i A ... A ai,ni) V ... V (a^^i A ... A afc,nfc) and each m^ of the 
form (miq A ... A mi_„i) V ... V {mk,i A ... A mk,nk)- Observe that each at is 
anti-monotonic because all the conjunctions in ai are (a conjunction of anti- 
monotonic atoms is anti-monotonic) and a disjunction of anti-monotonic queries 
is also anti-monotonic. Similarly, each m* is monotonic. A further contribution by 
De Raedt et al. is that they show that the dimension of a query q{r) corresponds 
to the length of the longest alternating chain in its solution space. More formally. 



^ The best case corresponds to the situation where all atoms evaluate to the empty set. 
In this uninteresting case, the number of version spaces needed is 0. The worst case 
arises when all queries (that are not logically equivalent) have a different solution set. 
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an alternating chain of length A: is a sequence p\ =4 ■■■ =4 P2k-i such that for all 
odd i, Pi G (/(r) and for all even i, pi ^ 

In the context of query optimization, two other points are worth mentioning. 
These concern the use of background knowledge and subsumption. 

6.2 Subsumption 

We previously explained that two logically equivalent queries have the same set 
of solutions. Often it is useful to also consider subsumption among different 
queries. 

Lemma 7. Let q and q' be two boolean queries. If q logically entails q' , then 
Th{q, C,!)) C Th{q' ,£,!)) . We say that q' subsumes q. 

This property can be used in a variety of contexts, cf. [2]. First, during one 
query answering session there may be several related queries. The user of the 
inductive database might first formulate the query q' and later refine it to q. 
If the results of the previous queries (or query parts) are stored, one could use 
them as the starting point for answering the query. Secondly, one might use it 
during the query optimization process. E.g., suppose one derives a, q\/ q' such 
that q ^ q' . One could the simplify the query into q' . 

As opposed to this logical or intentional subsumption test, it is also possible 
to perform an extensional subsumption test among version spaces. 

Definition 11. A version space (S'i,Gi) extensionally subsumes (82,02) if and 
only if S '2 U G 2 C (Si, Gi). 

Whenever q is subsumed by q', (S(q),G(q)) will be extensionally subsumed 
by (S(q'),G(q')) though the reverse property does not always hold. 

6.3 Background Knowledge 

For many query languages it is useful to also formulate background knowl- 
edge. Such background knowledge then contains properties about the primitives 
sketched. We illustrate the idea on frequency. Let freq(p, d) denote the frequency 
of pattern p on data set d. Atoms would typically impose a minimum (or max- 
imum frequency) on a given data set. We can then incorporate the following 
logical sentences in our background theory KB. 

Vp, d, Cl, C2 : (freq(p, d) > ci) A (ci > C2) ^ freq(p, d) > C2 

Vp, d, Cl, C2 : (freq(p, d) < C2) A (ci > C2) ^ freq(p, d) < ci 

Such background knowledge can be used to reason about queries. This moti- 
vates the following theorems. 

Property 3. Let q and q' be two boolean queries and let KB be the background 
theory about the constraint primitives. If KB \= q ^ (f , then Th(q, L,V) C 
Th(q',C,V). 

In the light of the formulated background theory the first lemma would allow 
us to conclude that freq(r, d) > 3 subsumes freq(r, d) > 5. 
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7 Conclusion 

We have introduced a novel and expressive framework for inductive databases 
in which queries correspond to boolean expressions. To the best of the author’s 
knowledge this is the most general formulation of inductive querying so far, cf. 
also [12,9]. Secondly, we have shown that this formulation is interesting in that 
it allows us to represent the solution space of queries using border sets and also 
that it provides us with a logical framework for reasoning about queries and their 
execution. This provides hope that the framework could be useful as a theory of 
inductive databases. 

The results in the paper are related to the topic of version spaces, which 
have been studied for many years in the field of machine learning, cf. e.g. 
[24,25,22,26,29,8]. The novelty here is that the framework is applied to that 
of inductive querying of boolean expressions and that this requires the use of 
disjunctive version spaces (cf. also [26,30]). 

The work is also related to constraint based data mining, cf. e.g. [15,18]. 
However, as argued before, the form of query considered here is more general. 
The use of border sets for inductive querying has recently also been considered 

by [11]. 
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Abstract. The paper surveys basic principles and foundations of the 
GUHA method, relation to some well-known data mining systems, main 
publications, existing implementations and future plans. 



1 Introduction: Basic Principles 

GUHA (General Unary Hypotheses Automaton) is a method originated in Prague 
(in Gzechoslovak Academy of Sciences) in mid-sixties. Its main principle is to let 
the computer generate and evaluate all hypotheses that may he interesting from 
the point of view of the given data and the studied problem. This principle has 
led both to a specific theory and to several software implementations. Whereas 
the latter became quickly obsolete, the theory elaborated in the mean time has 
its standing value. Typically hypotheses have the form “Many A’s are B’s” (B 
is highly frequented in A) of “A,S are mutually positively dependent” . (Note 
that what is now called “association rules” in data mining occurs already in the 
first 1966 paper [14] on GUHA, see below.) A second feature, very important for 
GUHA, is its explicit logical and statistical foundations. 

Logical foundations include observational calculi (a kind of predicate calculi 
with only finite models and with generalized quantifiers, serving to express re- 
lations among the attributes valid in data) and theoretical calculi (a kind of 
modal predicate calculi serving to express probabilistic or other dependencies 
among the attributes, meaningful in the universe of discourse). Statistical foun- 
dations include principles of statistical hypotheses testing and other topics of 
exploratory data analysis. Statistical hypothesis testing is described as a sort 
of inference in logical sense. But note that GUHA is not bound to generation of 
statistical hypotheses; the logical theory of observational calculi is just logic of 
patterns (associations, dependencies etc.) contained (true) in the data. 

The monograph [15] contains detailed exposition of fundamentals of this the- 
ory. The underlying logical calculi are analyzed and several basic facts for cor- 
responding algorithms are proved. Special attention is paid to deduction rules 
serving for optimization of knowledge representation and of intelligent search^ . 



^ This book [15] has been not more obtainable since several years ago. We are happy to 
announce that its pnblisher. Springer- Verlag, reverted the copyright to the anthors 
which has made possible to put the text of the book on web for free copying as a 
report of the Institute of Computer Science [16]. 



R. Meo et al. (Eds.): Database Support for Data Mining Applications, LNAI 2682, pp. 135— 153, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 
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The reader is hoped to realize that the aim of this paper is not mere stressing 
of antiquity of GUHA; on the contrary, we want to show the reader that GUHA 
(even if old) presents an approach to data mining (knowledge discovery) that is 
valuable and useful now. We summarize and survey here the underlying notions 
and theory (Sect. 2 and 3), one contemporary implementation and its perfor- 
mance evaluation (Sect. 4), as well as an example of particular application to 
financial data (Sect. 5); furthermore, we comment on the relation of GUHA to 
current data mining and data base systems (Sect. 6 and 7). We conclude with 
some open problems. 



2 Hypotheses Alias Rules 

For simplicity imagine the data processed by GUHA as a rectangular matrix 
of zeros and ones, the rows corresponding to objects and columns to some 
attributes. (Needless to say, much more general data can be processed.) In 
the terminology of [1], columns correspond to items and rows describe item- 
sets corresponding to transactions. In logical terminology one works with pred- 
icates Pi, . . . , Pn (names of attributes), negated predicates ^P\, . . . ^Pn, ele- 
mentary conjunctions (e.g. Pi&^Ps&Pr) and possibly elementary disjunctions 
(Pi V ^Pa V P7). 

Needless to say, an object satisfies the formulas Pi&^Pa&Py if its row in the 
data matrix has 1 in the fields no. 1 and 7 and has 0 in the field no. 3; etc. 

A hypothesis (association rule, observational statement) has the form 

where (p, tp are elementary conjunctions and ~ is the sign of association. Logically 
speaking it is a quantifier, in the present context just understand the word 
“quantifier” as synonymous with “a notion of association” . The formulas Lp, ip 
determine four frequencies a, b, c, d (the number of objects in the data matrix 
satisfying iphip,iph-'ip,^Lphip,^Lph^ip respectively). They are often presented 
as a four-fold table: 





Ip —•Ip 




a b 


-.Lp 


c d 



The semantics of ~ is given by a function tr.^ assigning to each four-fold 
table a, b, c, d the number 1 (true - the formulas with this table are associated) 
or 0 (false - not associated). Glearly there may be many such functions, thus 
many quantifiers. The intuitive notion of (positive) association is: the numbers 
a, d (counting objects for which both p>, ip are true or both are false - coinci- 
dences) somehow dominates the numbers b,c (differences). Precisely this leads 
to the following natural condition on the truth function trr^ of if (a, b, c, d) 
and {a' ,b' ,c' ,d') are two four-fold tables and a' > a,b' < b,c' < c,d' > d 
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(coincidences increased, differences diminished) then tr^{a,b,c,d) = 1 implies 
tr^{a' ,b' ,c' ,d!) = 1. Such quantifiers are called associational (see [14]). 

Let us shortly discuss three classes of associational quantifiers. 

(1) Implicational. Example: The quantifier =i*p,s of founded implication: hypoth- 
esis true if a/{a + b) >p and a> s (see [14]). This is almost the semantics 
of Agrawal [1] (only instead giving a lower bound for the absolute frequency 
a he gives a lower bound minsup for the relative frequency a/m where m is 
the number of transactions; this is indeed a very unessential difference.) 

This quantifier does not depend on c, d (the second line of the four- fold 
table. Thus it satisfies: 

if a' > a, b' < b and {a, b, c,d) = 1 then (o', b' ,c' ,d') = 1. 

Quantifiers with this property are called implicational. They are very use- 
ful; ip expresses the intuitive property that many objects satisfying ip 

satisfy ip. For statistically motivated implicational quantifiers see [14]. But 
they have one weakness: they say nothing on how many objects satisfying 
-^Lp satisfy i/ : also many? or few? Both things can happen. The former case 
means that in the whole data many objects have if. Then ip ip does not 
say much. This leads to our second class: 

(2) Comparative. Example: The quantifier ~ of simple deviation: hypothesis true 

if ad > be (equivalently, if in words ip is more frequent among 

objects satisfying <p than among those satisfying ^(p). 

Note that this does not say that many objects having <p have ip] e.g. 
among objects having 10% have ip, but among those having ip, 30% 
have Ip. (Trivial example: smoking increases the occurence of cancer.) An 
associational quantifier is comparative if tVr.^{a,b,c,d) = 1 implies ad > be. 
We may make the simple quantifier above dependent on a parameter, and 
define tr..^{a,b,c,d) = 1 if ad > K.bc {K a constant) etc. For statistically 
motivated comparative quantifiers (Fisher, chi-square) see again [14]. 

(3) Combined. We may combine these quantifiers in various ways, getting new 
quantifiers. Here are some examples. Let be an implicational quantifier 
and a comparative associational quantifier. 

Define tp ip iS. (p ip and ip tp. Thus 

tr.^* (a, b, c,d) = 1 iff tr^» (a, b, c,d) = l and tr^» (a, b, c, d) = 1. 

This quantifier depends only on a, b, c (and it is not comparative). It is a 
pure doubly implicational quantifier [36]. Pure doubly implicational quanti- 
fiers are a special case of doubly implicational quantifiers [37]. 

Define tp =* ip iS. (p =J>* ip and ~^ip -'ip. Then 

tr=* (a, b, c,d) = l iff (a, b, c,d) = l and tr^» (d, c,b,a) = 1. 

This is a pure equivalence quantifier [36]. 

Exercise: Show for that if p > 0.5 then the corresponding quantifier 

is comparative. 
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Last example: define (/? tp if ip =J>* ip and tp ^ ip. Thus 

(a, b, c,d) = l iff (a, b, c, d) = tr.^(a, b, c, d) = 1. 

Now p^*ip says: many objects having p have ip and not so many objects 
having have ip. 

Various implementations of GUHA work with various choices of such quanti- 
fiers, either directly (generating true sentences e.g. with or indirectly: during 
the interpretation of results with =^*, say, one can sort our those some sentences 
p Ip satisfying also p ^ ip. 



3 More on the Underlying Theory 

Here we survey some notions and aspects of theoretical foundations of GUHA, 
mentioned in the introduction. The section may be skipped at first reading. 

Observational logical calculi. The symbols used are: unary predicate 
Pi, . . . , Pn, a unique object variable x (that may be omitted in all occurrences), 
logical connectives (say, &, V, -• - conjunction, disjunction, implication, nega- 
tion), one or more quantifiers (and brackets). Atomic formulas have the form 
Pi{x) (or just Pi)] open formulas result from atomic ones using connectives (e.g. 
(P1&P3) V ~^Pr etc.) If Pi, . . . ,Pn name the columns of a given data matrix it 
should be clear what it means that an object satisfies an open formula in the 
data matrix. 

Each quantifier has dimension 1 (is applied to one open formula) or 2 (applied 
to two formulas). (Higher dimensions possible.) Classical quantifiers V, 3 (for all, 
there is) have dimension 1. If p is open formula then pix)p and {^x)p are 
formulas (in brief, f!p,3p). Other examples: {Many x)p, {Based x)p. Let a be 
the number of objects satisfying p, b the number of objects satisfying —<p. The 
semantics of a quantifier q is given by a truth function Trg{a,b) G {0, 1}. For 
example, Tr\/{a,b) = 1 iff 5 = 0; Tr^{a,b) = 1 iff a > 0. Given 0 < p < 1, 
TrMany{ci, 6) = 1 iff a/{a+b) > p; given a natural number r > 0, TrBased{ci, b) = 
1 iff a > T. Examples of quantifiers of dimension 2 and their truth functions were 
discussed above. A formula {qx)p is true in given data of Trq{a,b) = 1 where 
a, b are the frequencies as above. Formulas of the form {qx)p {q one-dimensional 
quantifier) or (~ x){p,ip) (written also p ^ ip, two-dimensional quantifier) are 
called prenex closed formulas. Other closed formulas result from prenex ones 
using connectives, eg. Vp ^ Many{p). 

A deduction rule has some assumptions and a conclusion. It is sound if for 
each data set D, whenever the assumptions are true in D, then the conclusion 
is true in D. 

Examples: Let p,ip,\ be open formulas. 

(1) For each associational quantifier ~, the following is a sound rule: 

p ~ {pk.ip) 
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(2) For each implicational quantifier =i>*, the following are sound rules: 

tp (t/?fcx) 

ip^* (tpwx) P V ^x) 

(3) A quantifier ~ is cr-based (cr natural) if BasfftTtpSzip) ^ sound rule, i.e. 
whenever p ^ tp is true in D then the frequency a = fr{p&zxp) satisfies 
a > a. Then the following rules are sound: 

-^Baseda{phip) ^Based„{ph'ip) 

~ Ip) ^Baseda-ipSz'tpSzx) 

This is very useful for pruning the tree of all hypotheses p ^ tp : ii we 
meet p,ip such that for them a < a then all hypotheses (p^x) ~ '0 and 
p ~ {tphx) may be deleted from consideration. 

(4) Some quantifiers (e.g. our SIMPLE, FISHER, CHISQ) make the following 
rules (of commutativity and negation) sound: 

p ^ ip p ^ Ip 

Ip ^ p' ~^p ~ —'ip 

This is also useful for the generation of hypotheses. 

The general question if the deduction rule of the form is sound was also 

studied [33], [37]. It was shown that there is relatively simple condition equivalent 
to the fact that the deduction rule is sound. The condition depends on the 
class of quantifiers the quantifier ~ belongs to. (There are classes of implicational, 
B - double implicational and E - equivalency quantifers.) This condition concerns 
several propositional formulas of the form <p{p,'p,p' ,p') derived from p pj, p' 
and p' . It is crucial if these formulas are tautologies of propositional calculus or 
not. 

Observe that instead of saying that a rule, ^ say, is sound (<?, 'd' closed for- 
mulas) we may say that the formula <P ^ d' (where — > is the connective of 
implication) is a tautology - is true in all data sets. Note that computational 
complexity of associational and implicational tautologies (i.e. formulas being 
tautologies for each associational/implicational quantifier) has been studied in 
[11] and [9]. 

To close this section, let us briefly comment on our theoretical logical calculi. 
They serve as statistical foundations for the case that observational hypotheses 
use quantifiers defined by some test statistics in statistical hypothesis testing. In 
this case one has, besides the observational language, its theoretical counterpart, 
enabling to express hypotheses on probabilities of open formulas in the unknown 
universe, from which our data form a sample, for example P{'tp/p) > P{ip) (the 
conditional probability of ip given p is bigger than the unconditional probability 
of tp). Let THyp{p,'ip) be such a theoretical hypothesis and let p ^ -ip be an 
observational hypothesis. This formula is a test for THyp{p,ip) if under some 
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“frame assumption” Frame, one can prove that if THyp(ipxp) were false then 
the probability that ip ^ tp is true in a data sample would be very small: 

Frame, ^TF[yp{ip,ip) 

P{(f ~ Ip true in data ) < a 

where a is a significance level. (This is how a logician understands what statis- 
ticians are doing.) 

Note that the above concern just one pair TF[yp{ip, 'ip),ip ^ 'ip. What can one 
say if we find several hypotheses (p ^ ip true in the data? This is the problem of 
global interpretation of GUHA results, discussed also in [15]. 

A generalization of the above for testing fuzzy hypotheses in elaborated in 
[26], [27]. Further theoretical results concerning namely new deduction rules, 
fast verification of statistically motivated hypotheses and dealing with missing 
information were achieved in [33] and later in [36]. Some of these results are 
published in [35], [37] and [38]. These results are applied in the GUHA procedure 
4ft-Miner. 



4 The GUHA Procedure 4ft-Miner 

4.1 4ft-Miner Overview 

The GUHA method is realised by GUHA-procedures. A GUHA-procedure is a 
computer program, the input of which consists of the analysed data and of a few 
parameters defining a possibly very large set of potentially interesting hypothe- 
ses. The GUHA procedure automatically generates particular hypotheses from 
the given set and tests if they are supported by analysed data. This is done in 
an optimized way using several techniques of eliminating as many hypotheses as 
possible from the verification since their truth/falsity follows from the facts ob- 
tained in the given moment of verification. The output of the procedure consists 
of all prime hypotheses. The hypothesis is prime if it is true in (supported by) 
the given data and if it does not follow from the other output hypotheses. 

Several GUHA procedures were implemented since 1966, see Section 6. One of 
the current implemetations is the GUHA procedure 4ft- Miner. It is a part of the 
academic software system LISp-Miner. The purpose of this system is to support 
teaching and research in the field of KDD (knowledge discovery in databases) . 
The system consists of several procedures for data mining, machine learning and 
for data exploration and transformations. It is developed by a group of teachers 
and students of the University of Economics Prague see [44]. The whole LISp- 
Miner system can be freely downloaded, see [47]. 

The GUHA procedure 4ft-Miner deals with data matrices with nominal val- 
ues. An example of such data matrix A4 is in figure 1. 

Rows of data matrix M correspond to observed objects, columns of data 
matrix correspond to attributes Ui, . . . , Vk- We suppose that v\p is the value 
of attribute Ui for the first object, Vn,K is the value of attribute Vk for the last 
n-th object etc. 
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Ui 


U2 


Vk 


Vi,i 


Vl,2 


Vl,K 


Vn,l 


Vn,2 


Vn,K 


Fig. 1. Data matrix A4 



The 4ft-Miner mines for hypotheses of the form cp ip and for the conditional 
hypotheses tp « 4’/x- Here p, ip and y are conjunctions of literals. A literal can 
be positive or negative. A positive literal is an expression of the form V (a) where 
V is an attribute and a is a proper subset of the set of all categories (i.e. possible 
values) of V. A negative literal is the negation of a positive literal. The set a is 
the coefficient of the literal V{a). 

The literal V (a) is true in the row of the given data matrix if and only if 
the value v in this row and in the column corresponding to the attribute V 
is an element of a. Let {A,B,C,D,E,F} be a set of all possible values of the 
attribute V. Then E(A) and E(A,B,F) are examples of literals derived from the 
attribute V. Here {A} is the coefficient of F(A) and {A,B,F} is the coefficient 
of F(A,B,F). 

A conditional hypothesis p « i/'/x is true in the data matrix A4 iff the 
hypothesis p ^ ip is true in the data matrix M.jx- The data matrix M.jx 
consists of all rows of the data matrix satisfying the boolean attribute y (we 
suppose there is such a row) . 

The GUHA procedure 4ft-Miner automatically generates and verifies the set 
of hypotheses in the given data matrix. The set of hypotheses to be generated 
and tested is defined by 

• definition of all antecedents, 

• definition of all succedents, 

• definition of all conditions, 

• definition of the quantifier and its parameters (there are 17 types of various 
quantifiers, see http: / /lispminer. vse.cz/overview/4ft_quantifier.html) . 

The antecedent is the conjunction of literals automatically generated from the 
given set of antecedent attributes. (There cannot be two literals created from 
the same attribute in one hypothesis.) The set of all antecedents is given by 

• a list of attributes - some of them are marked as basic (each antecedent must 
contain at least one basic attribute), 

• a minimal and maximal number of attributes to be used in antecedent, 

• a definition of the set of all literals to be generated from each attribute. 

The set of all literals to be generated for particular attribute is given by: 

• a type of coefficient (see below), 

• a minimal and maximal number of categories in the coefficient, 

• positive/negative literal option: 
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- only positive literals will be generated, 

- only negative literals will be generated, 

- both positive and negative literals will be generated. 

There are six types of coefficients: subsets, intervals, left cuts, right cuts and 
cuts. 4ft-Miner generates all possible coefficients of the given type with respect 
to the given minimal and maximal number of categories in the coeficient. An 
example of the coefficient of the type subset is the coefficient {A,B,F} of the 
literal V (A,B,F). 

Further coefficients can be defined for attributes with ordinal values only. 
Let us suppose that the attribute O has the categories 1, 2, . . . , 8, 9. Then the 
coefficient of the literal 0(4, 5, 6, 7,8) is of the type interval - it corresponds to the 
interval of integer numbers (4, 8). The coefficient of the literal 0(4, 6, 9) is not of 
the type interval. A left eut is an interval that begins at the first category of the 
attribute. The coefficient of the literal 0(1, 2, 3) is of the type left cut and the 
coefficient of the literal 0(4, 5, 6,) is not of the type left cut. Analogously a right 
cut is an interval that ends at the last category of the attribute. A cut is either 
a left cut or a right cut. 

The output of the procedure 4ft-Miner consists of all prime hypotheses. There 
are also various possibilities for sorting and filtering of the output hypotheses. 

The GUHA procedure 4ft-Miner was several times applied e.g. in medicine 
(see e.g. [4]), sociology and traffic. Modular architecture of the LISp-Miner sys- 
tem makes it possible to create a “tailor-made” interface customized for spe- 
cialists in various fields. The only notions familiar to the field specialists are 
used in such interface. An experiment with on-line data mining can b found at 
http://euromise.vse.cz/stulong-en/online/. A very simple interface to the proce- 
dure 4ft-Miner is here used. Let us remark that there are some activities related 
to the conversion of hypotheses produced by 4ft-Miner into natural language 
[45]. 

4.2 A Performance Analysis 

The usual A-priori algorithm [1] is not applicable for 4ft Miner. Instead, it the 
fast-mining-bit-string approach based on representation of analysed data by suit- 
able strings of bits. The core of this approach is representation of each possible 
value (i.e. category) of each attribute by a string of bits [31]. This representation 
makes possible to use simple data structures to compute bit-string represen- 
tations of each antecedent and of each succedent. Bit-string representations of 
antecedent and succedent are used to compute necessary four-fold tables in a 
very fast way [42]. The resulting algorithm is linearly dependent on the number 
of rows of the analysed data matrix. Let us note that this approach makes possi- 
ble to effectively mine for conditional asociation rules and for rules with literals 
with coefficients with more than one value, e.g. y(A,B,F) see above. 

We describe the solution of the same task on three data matrices with the 
same structure but with various numbers of rows to demonstrate linearity of 
this algorithm. Our task was derived from a database of the fictitious bank 
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Id 


Age 


Sex 


Salary 


District 


Amount 


Repayment 


Loan 


1 


41-50 


M 


high 


Prague 


20-50 


1-2 


good 


2 


31-40 


F 


average 


Pilsen 


> 500 


9-10 


bad 


6181 


41-50 


F 


high 


Brod 


250-500 


2-3 


good 



Fig. 2. Data matrix LoanDetail 

BARBORA (see http://lisp.vse.cz/pkdd99/). We start with data matrix Loan- 
Detail, see Fig. 2. 

Each row of data matrix LoanDetail corresponds to a loan, there are 6181 
loans. Columns Age, Sex, Salary and District describe clients, columns Amount, 
Payment and Loan correspond to loans. The first row describes a loan given to 
a man with the age 41-50. He has a high salary and he lives in Prague. Further, 
he borrowed amount of money in the interval 20-50 thousands of Czech crowns, 
he repays 1-2 thousands of Czech crowns and the quality of his loan is good 
from the point of view of the bank. 

There are five possible values 21-30, . . . , 61-70 of the attribute Age, six 
values < 20, 20-50, 50-100, 100-250, 250-50, > 500 of the attribute Amount 
and ten posible values < 1, 1-2, , . . . , 9-10 of the attribute Repayment. 

We are interested in all pairs {segment of clients, type of loan) such that 
the assertions to be a member of segment of clients and to have a bad loan of 
certain type are in some sense equivalent. This can expressed by a conditional 
association rule 

SEGMENT 44>o.95,20 Loan(bad) /TYPE 

where SEGMENT is Boolean attribute defining a segment of clients, Loan(bad) 
is Boolean attribute that is true iff the corresponding loan is bad and TYPE is 
Boolean attribute defining type of loan. 

The above rule SEGMENT <^ 0 . 95,20 Loan(bad) /TYPE is true in the 
data matrix LoanDetail if the rule SEGMENT 44>o,95,20 Loan(bad) is true in 
the data matrix LoanDetail/ TYPE (i.e. in the data matrix consisting of all rows 
of data matrix LoanDetail sastisfying the Boolean attribute TYPE) describ- 
ing type of loan. The association rule SEGMENT 44>o.95,20 Loan(bad) is true 
in data matrix LoanDetail/ TYPE if the condition > 0.95 A a > 20 is 

satisfied where a, b, c are frequencies from four-fold table of Boolean attributes 
SEGMENT and Loan(bad) in data matrix LoanDetail/ TTPA: 





Loan(bad) 


^ Loan(bad) 


SEGMENT 


a 


b 


^SEGMENT 


c 


d 



Let us remark that the rule SEGMENT <tAo, 95,20 Loan(bad) /TYPE says: 
If we consider only loans of the type TYPE then there are at least 95 per 
cent of loans satisfying both SEGMENT and Loan(bad) among loans satisfying 
SEGMENT or Loan(bad). 
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The Boolean attribute SEGMENT is derived from columns Age, Sex, Salary 
and District of data matrix LoanDetail. An example is a segment of all men 
21-30 years old defined by SEGMENT = Sex(M) A Age(21~30). The Boolean 
attribute TYPE is defined similarly by columns Amount and Repayment. An 
example is Boolean attribute Amount (< 20) A Repayment (1-2) defining a type 
of loans: borrowed less than 20000 Gzech crowns and repaid between 1000 and 
2000 Gzech crowns each month. 

Our task can be solved by the 4ft-Miner procedure where all antecedents to 
be automatically generated are specified as conjunctions of one to four literals 
of the form Age(?), Sex(?), Salary(?) and either District(?) or District(?,?) (we 
consider single district or pairs of districts) where each ? can replaced by one 
possible value of the corresponding attribute. There are 6 literals of the form 
Age(?), 2 literals of the form Sex(?), 3 literals of the form Salary(?) and 3003 = 
(Y) CJ) literals of the form either District(?) or District (?,?). 

If we use terminology of the 4ft-Miner input we can say that the minimal 
length of antecedent is 1, maximal length of antecedent is 4, positive literals - 
subsets will be generated for all attributes Age, Sex, Salary and District. The 
minimal number of categories in all coefficients will be 1, the maximal number 
of categories in the coefficients for Age, Sex and Salary will be 1, the maximal 
number of categories in the coefficients for District will be 2. This defines a set 
of more than 250 000 antecedents. 

Similarly we can ask 4ft-Miner to use the Booolean attribute Loan(bad) as 
a succedent and further to automatically generate all conditions as conjunc- 
tions of one or two literals of the form Amount (?) or Repayment(?). There are 
76 such conditions. Together there are more than 19 * 10® rules of the form 
SEGMENT <tAo. 95,20 Loan(had) /TYPE defined this way. 

The task of automatical generation and verification of this set of rules is 
solved in 56 seconds (PG with Pentium II processor, 256 MB RAM). Let us em- 
phasise that only 80 549 rules had actually to be fully processed due to various 
optimisations used in the 4ft-Miner procedure (pruning of the tree of hypothe- 
ses). The result of the run of the 4ft-Miner consists of 10 true association rules. 
An example is the rule 

Salary(average) A District(01omouc, Tabor) 44>i,o,27 Loan(bad) / 

/Amount (20-50) A Repayment(2-3) . 

The four-fold table of this rule is 





Loan(bad) ^ Loan(bad) 


Salary(average) A District(01omouc, Tabor) 
^ Salary(average) A District(01omouc, Tabor) 


27 0 

0 144 



It concerns the data (sub)matrix LoanDetail/(Amount(20-50)ARepay- 
ment(2-3)) with 171 rows. 

We solved the same task on the data matrices LoanDetaiLlO and LoanDe- 
taiL20 on the same computer. Data matrix LoanDetaiLlO has ten times more 
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rows than data matrix LoanDetail. Each row of data matrix LoanDetail is used 
ten times in data matrix LoanDetaiLlO. Analogously, each row of data matrix 
LoanDetail is used twenty times in data matrix LoanDetail_20. Data matrix 
LoanDetail_20 has twenty times more rows than data matrix LoanDetail. 

We use the quantifier 44>o.95,20o in the data matrix LoanDetaiLlO instead of 
the quantifier 44>o.95,20 to ensure the same behaviour of the optimised algorithm. 
Similarly we use the quantifier 44>o. 95,400 in the data matrix LoanDetail_20. The 
solution times necessary to solve the same task on all data matrices are compared 
in the following table. 



Data matrix 


LoanDetail 


LoanDetaiLlO 


LoanDetail_20 


number of rows 
time total in seconds 
quantifier 

fully processed rules 
from them true 


6181 

56 

"t4>0.95,20 

80 549 
10 


61810 

513 

"t4>0.95,200 

80 549 
10 


123 620 
1066 

"t4>0.95,400 

80 549 
10 



We can conclude that the solution time is approximately linearly dependent 
on the number of rows of the analysed data matrix. 



5 An Example of Application of GUHA 

This section gives a short report on the GUHA approach to the discovery chal- 
lenge for a financial data set, which was a part of the 3rd European Conference 
on Principles and Practice of Knowledge Discovery in Databases (PKDD’99), 
held in Prague, September 15-18, 1999. 

The goal of challenge was to characterize clients of the BARBORA fictitious 
bank (see the preceding section), particularly those having problems with loan 
payments, using data mining methods. Note that the run described here is dif- 
ferent from that described in the preceding section (and was done using the 
GUHA-I— implementation [46]). 

The financial data were stored in eight tables of relational database. Each 
table consisted of several attributes characterizing different properties of an 
account. Hence the object of investigation was an account and attributes its 
properties. An account had static and dynamic characteristics. Static character- 
istics were given by tables account, client, disposition, permanent order, loan, 
credit card and demographic data. Dynamic ones were given by table transac- 
tion. 

5.1 Data Description and Preprocessing 

Within the preprocessing stage, new attributes were computed on base of original 
ones. The following list consists of attributes computed from static characteris- 
tics of account: 
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ACCOUNT_YEAR; ACCOUNT JFREQ; ACCOUNT JDIST% 

OWNER.SEX; OWNER.AGE; OWNERJ3IST%; USER 
ORDERJNSURANCE; ORDER JIOUSEHOLD; ORDERJ.EASING; 
ORDERJ.OAN; ORDER_SUM; ORDER.OTHER; 

LOAN_STATUS; LOAN_AMOUNT; LOAN_DURATION; 

LOANJPAYMENT; LOAN_YEAR 
CARD.TYPE; CARD.YEAR. 

Since lack of space for detail explanation of attributes we do this in the 
next section only for attributes presented in found hypotheses. On aggregate, 48 
attributes describing static characteristics of account were computed. Note that 
ACCOUNT_DIST% and OWNERJ3IST% are shortcuts each for 15 attributes 
related to properties of district an account is held or an owner lives. 

Dynamic characteristics of account were given by table transaction. On base 
of them these new attributes were computed: 

AvgM_AMOUNT_SIGN; Min M AMOUNT SIGN; MaxM_AMOUNT_SIGN; 
AvgM.VOLUME; MinM.VOLUME; MaxM.VOLUME; 

AvgMJBALANCE; MinMJBALANCE; MaxMJBALANCE; 
AvgM_WITHDRAWAL_CARD; AvgM_CREDIT_CASH; 

AvgM.COLLECTION; AvgM_WITHDRAWAL_CASH; 

AvgM_REMITTANCE; AvgMJNSURANCE; AvgM_STATEMENT; 
AvgM.CREDITEDJNTEREST; AvgM.SANCTIONJNTEREST; 
AvgM_HOUSEHOLD; AvgM_PENSION; AvgMiOAN; 

AvgM_TRANS ACTION^# . 

Attributes give average, minimal or maximal month cash flow with respect 
to different types of transactions. Attribute AvgM_TRANS ACTION.# refers 
to averaged number of realized transactions per a month. On aggregate, 22 
attributes describing dynamic behavior of an account were defined. 

The total number of attributes characterizing an account used for CUBA 
method was 70. Attributes were either nominal (range was finite set of features, 
typically presence or absence of some fact) or ordinal (range was set of reals typ- 
ically giving amount of money in some transaction) . Attributes were categorized 
into several categories, (a category is actually Boolean attribute) at a medium 
into 5 for each one, to obtain basic matrix of zero and ones suitable for CUBA 
processing. 



5.2 Discovered Knowledge 

We had this natural point of view of good and bad clients. If a loan for client is 
granted then client is good if there have been no problems with loan payments, 
and clearly client is bad when there have been problems with payment. In our 
discovery work we concentrated on exploring appropriate characterizations of 
this notion of good or bad client. The consequence is that we aimed only on 
data of clients with granted loan. 

We assume that eminent interest of every bank is to reveal in advance if a 
client asking for a loan is good or bad according to our definition. This prediction 
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has to be based on information the bank knows in time of asking for a loan. 
That is why we carried out our exploration from data (namely data from table 
transaction) that were older than date when the loan was granted. 

In the first phase we aimed on hypotheses with single antecedents and char- 
acterization of good clients. These clients have accounts satisfying category 
LOAN_STATUS:GOOD. By proceeding preprocessed data by GUHA-I — soft- 
ware we have explored hypotheses summarized in Table 1. We have used Fisher 
quantifier with the significance level alpha=0.05 and restricted ourselves only 
on hypotheses supported at least by 15 objects (a > 15). In Table 1 presented 
statistic Prob is given by fraction a/{a + b) and characterizes hypotheses in sense 
of an implication. 

Table 1. Gharacterization of good clients - succedent LOAN_STATUS:GOOD 





antecedent 


ff-table 


Eisher 


Prob 


1 


AvgM_SANCTION_INTEREST:NO 


603 


50 


6.12e-024 


0.92 


3 


26 


2 


ORDER_HOUSEHOLD:YES 


421 


20 


5.03e-013 


0.95 




185 


56 


3 


USER: YES 


145 


0 


3.76e-009 


1.00 


461 


76 


4 


CARD .TYPE: CARD _YES 


165 


5 


1.39e-005 


0.97 




441 


71 



Hypotheses are sorted increasingly according to value of Fisher statistic. Hy- 
pothesis #1 says that there is strong association between absence of sanction 
interest payment and good loan payments. Hypothesis #2 says that there are 
no problems with loan payments if household permanent order is issued on the 
account. Hypotheses #3 and #4 ties good loan payments policy with a presence 
of another client who can manipulate with account (typically, client’s wife or 
husband) or presence of card issued to the account, respectively. 

Presented Table 1 can be used for characterization of bad clients, satisfying 
LOAN_STATUS:BAD category, as well. Due to symmetry of Fisher quantifier we 
can read each hypothesis as “if a client do not satisfy antecedent of hypothesis 
then there is greater probability than on average that the loan will not be paid.” 
For example for hypothesis ^3, if only owner can manipulate with the account 
then it is more risky to grant him by a loan than if there would be another 
person who can manipulate with the account as well. 

In the second phase of our exploration we aimed on hypotheses with com- 
pound antecedents of length 2. We used Fisher quantifier with significance level 
alpha=0.001, restricted on hypotheses supported at least by 15 objects (a > 
15). We applied other restriction to choose only 100% implication hypotheses 
(Prob=1.00). 

These hypotheses are 100% implications as it can be seen from the four fold 
tables. In antecedents of hypotheses of Table 2 there are combinations of other 
properties of account that (in data) gives 100% warranty that loan will be paid 
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Table 2. Characterization of good clients - succedent LOAN_STATUS:GOOD 



* 


antecedent 


ff-table 


Fisher 


Prob 


1 


ORDER_OTHER:NO 


208 


0 


1.31e-013 


1.00 


ORDER.HOUSEHOLD: YES 


398 


76 


2 


ACC_DIST_A7: > 3 


139 


0 


9.33e-009 


1.00 




ORDER_HOUSEHOLD: > 5000 


467 


76 



back. The category ORDER_OTHER:NO says that there is no other kind of 
permanent order issued to the account besides for insurance, household, leasing 
or loan payments. Category ORDER_ HOUSEHOLD: > 5000 clearly denotes 
clients with household permanent order payment greater than 5000. Category 
ACCJI)IST_A7:> 3 says that bank address is in the district with more than 3 
municipalities with number of inhabitants from 2000 to 10000. 

Form of antecedents for LOAN .STATUS :BAD succedent reveals that critical 
property for bad loan payments is presence of sanction interest, but even in this 
case the loan can be paid back. However, if there is sanction interest issued on 
account together with a property from set given by first members of antecedents 
of hypotheses in Table 3, there were always problems with payments. The cate- 
gories in the table have the following meaning: 

Table 3. Characterization of bad clients - succedent LOAN_STATUS:BAD 



# 


antecedent 


ff-table 


Fisher 


Prob 


1 


OWNER_DIST_A13 : > 3 


21 


0 


6.27e-022 


1.00 


AvgM_SANCT. .INTEREST: YES 


55 


606 


2 


OWNER.DIST.A7: < 6 


20 


0 


7.41e-021 


1.00 




AvgM.SANCT. .INTEREST: YES 


56 


606 


3 


OWNER.DIST.A6: < 27 


17 


0 


l.lle-017 


1.00 


AvgM_SANCT. .INTEREST: YES 


59 


606 



OWNER_DIST_A13:> 3 - unemployment rate in the year 1996 and in the 
district where the client lives was greater than 3%; 

OWNER_DIST^7:< 6 - the client’s address is in a district with maximally 
6 municipalities with number of inhabitants from 2000 to 10000; 

OWNERJDIST.A6:< 27 - the address of the client is in a district with 
maximally 27 municipalities with number of inhabitants from 500 to 2000. 

In this section we presented a short example of GUHA method’s employment 
in data mining process. To full review of coping with financial data set challenge 
by GUHA see [2]. 

6 Relation to (Relational) Data Mining and Discovery 
Science 

Both data mining and discovery science are terms that emerged recently and 
have aims similar to each other as well as to the main ideas of GUHA declared 
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from its beginning: to develop methods of discovering (mining) knowledge form 
data (usually large data). Relations of GUHA and its ancestors to data mining 
and discovery science were analyzed in [35], [17], [26], [12]. 

In particular, our hypotheses described above are more general than Agrawal’s 
association rules particularly by (1) explicit use of negations and (2) choice from 
a variety of quantifiers, not just FIMPL. On the other hand, Agrawal’s data 
mining is particularly developed for processing extremely huge data, which in- 
fluence the choice of techniques for pruning the system (tree) of hypotheses = 
rules. These aspects and possibilities of mutual influence of GUHA and Agrawal’s 
approach are analyzed especially in [12]. 

The fact that Agrawal’s notion of an association rule [1] occurs in fact in [14] 
has remained unnoticed for long; but note that it is explicitly stated e.g. in [29]. 
Let us stress that priority questions are by far not the most important thing; 
what is valuable is possible mutual influence. (This is discussed e.g. in [12]). 

A relatively new direction of data mining is relational data mining [5]. It 
means that the input of the mining procedure is not one single database ta- 
ble but several tables. GUHA procedures dealing with several database tables 
were also defined and studied see [33], [34], [40]. Such GUHA procedure mines 
for multi-relational association rules. GUHA hypotheses as described above are 
formulae of observational calculi that can be obtained by a modification of clas- 
sical predicate calculi. Multi-relational hypotheses are formulae of many-sorted 
observational calculi with possibly binary, ternary,., predicates that can be ob- 
tained by a modification of classical many-sorted predicate calculi. Some theoret- 
ical results concerning e.g. decidability of many-sorted observational calculi were 
achieved [33] . It can be useful to compare GUHA approach to further techniques 
of relational data mining. 



7 The Relation of the GUHA Method to Modern 
Database Systems 

The GUHA implementations were used in various research domains (e.g. in 
medicine [22,28,43], pharmacology [23,24], banking [2,30] or in meteorology [3]), 
but admittedly, they never got a broad use. 

During the rather long history of the method there were created several im- 
plementations of the method reflecting the rapid development in information 
technologies (IT). The first implementations were realized on MINSK22 com- 
puter in 60’s and on mainframes in 70’s. In the 80’s implementations were trans- 
ferred on IBM PG platform which gave the PG-GUHA implementation for MS 
DOS operation system. There are two present implementations named GUHA 
-I — [46] and 4FT-Miner [39]. The latter one was described above in Section 3. 

In the first implementations of the method data were stored in stand alone 
plain files. In these files data were already dichotomized (the oldest implemen- 
tations) or in a raw form, consequently dichotomized in an automatic way on 
base of scripts determined by user (PG-GUHA). However, such a way of access 
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to raw data was untenable in a light of developments in a database software in- 
dustry. Therefore present implementations (GUHA-I — , 4FT-Miner) employ uni- 
versal ODBC interface to access raw data typically stored in form of tables in 
MS Access or MS Excel software. But progress continues. 

Note that a GUHA-DBS database system was proposed in early 1980’s by 
Pokorny and Rauch [32], see also [34]. This is now rather obsolete and this 
development was neglected in GUHA for long. But the development of the 4ft- 
Miner brought important progress in this direction [44]. See also [6]. 

Contemporary trends in data mining area are driven by requirements for 
processing huge data sets which brings new research and implementation prob- 
lems for GUHA and its logical and statistical theory. Standard sources of huge 
databases - created typically in a dynamic way - are hypermarkets, banks, inter- 
net applications, etc. These data sets are enormous and professional databases as 
Oracle or MS SQL Server have to be used to manage them. To cope with these 
modern trends there was established a research group formed around COST 
Action 274 aiming on a new implementation of the GUHA method enabling to 
work efficiently with large data sets. 

Actually, there are generally two ways possible of raw data access in a new 
GUHA method’s implementation. 

The first way is to transform respective data objects of modern database 
systems (e.g., data cubes of MS SQL Server 2000) into standard tables and then 
access data by old algorithms and their generalisations in the spirit of relational 
data mining. The second approach is to homogeneously interconnect GUHA core 
algorithms with data access algorithms offered by modern data base systems 
issuing into a qualitatively new (faster) processing of huge data sets. Detailed 
description of work on this task can be found in the report [7]. 

The other direction of research we are interested in is an effort for incor- 
poration of GUHA method into modern databases as a support tool for an 
intellectual workflow related to information classification and structuring, ana- 
lytical processing and decision support modeling [6] see also experiments with 
incorporating GUHA procedure 4ft-Miner into medical data mining process at 
http: / / euromise.vse.cz / stulong-en/ . 

Rapid progress in the area of data base management systems (DBMS) suffi- 
ciently simplified information classification and structuring processes. Powerful 
database engines developed in last few years also improved analytical processing 
capability of a DBMS. The most significant gap in the automation of an intellec- 
tual workflow lies between analytical processing and decision support modeling 
(these two processes influence each other within decision driven loop [7].) 

To understand more precisely what the gap is we developed new model “IT 
preferences for a decision making” [8]. This model was applied for an evalua- 
tion of three different discovery engines currently existing (MDS Engine, GUHA 
Engine and DEX-HINT engine). This model enables us to specify a new combi- 
nation of features required for a successful GUHA implementation (called GUHA 
Virtual Machine) in the near future. The target implementation framework of 
the GUHA Virtual Machine relates to complex analytical and decision problems 
like radioactive waste management [21]. 
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8 Conclusion 

Research tasks include: Further comparison of GUHA theory with the approaches 
of data mining for mutual benefits. Systematic development of the theory in re- 
lation to fuzzy logic (in the style of Hajek’s monograph [10]). Development of 
observational calculi for temporal hypotheses (reflecting time). Systematic de- 
velopment of the database aspects, in particular improving existing methods and 
elaborating new methods of data pre-processing and post-processing of GUHA 
results. Design of a new GUHA-style system based on data received from dis- 
tributed network resources, and construction of a model of the customer decision 
processes. 
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Abstract. A logical language, SeqLog, for mining and querying sequen- 
tial data and databases is presented. In SeqLog, data takes the form of a 
sequence of logical atoms, background knowledge can be specified using 
Datalog style clauses and sequential queries or patterns correspond to 
subsequences of logical atoms. SeqLog is then used as the representation 
language for the inductive database mining system MineSeqLog. Induc- 
tive queries in MineSeqLog take the form of a conjunction of a monotonic 
and an anti-monotonic constraint on sequential patterns. Given such an 
inductive query, MineSeqLog computes the borders of the solution space. 
MineSeqLog uses variants of the famous level-wise algorithm together 
with ideas from version spaces to realize this. Finally, we report on a 
number of experiments in the domains of user-modelling that validate 
the approach. 



1 Introduction 

Data mining has received a lot of attention recently, and the mining of knowl- 
edge from data of various models has been studied. One popular data model 
that has been studied concerns sequential data [1,2, 3, 4, 5, 6]. Many of these ap- 
proaches are extensions of the classical level-wise itemset discovery algorithm 
“Apriori” [7]. However, the data models that have been used so far for modelling 
sequential patterns are not very expressive and are often based on some form of 
propositional logic. The need for more expressive kind of patterns arises, e.g., 
when modelling Unix-users [8]. As an example, the command sequence 

1. Is 

2. vi paper.tex 

3. latex paper.tex 

4. dvips paper. dvi 

5. Ipr paper.ps 

can be represented as a sequence of first-order terms: “Is vi (paper .tex) 
latex (paper .tex) dvips (paper . dvi) Ipr (paper .ps)”. With such a represen- 
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(c) Springer- Verlag Berlin Heidelberg 2004 



Constraint Based Mining of First Order Sequences in SeqLog 155 



tation model, it is possible to discover first-order rules such as “vi(X) latex (X) 
is frequent”.^ 

Researchers such as Mannila and Toivonen [9] have realized the need for 
such more expressive frameworks. They have introduced a simple data model 
that allows one to use binary predicates as well as variables. In doing so, they 
have taken a significant step into the direction of multi-relational data mining 
and inductive logic programming. However, when comparing their data model 
to those traditionally employed in inductive logic programming, such as Data- 
log, the model is more limited and less expressive. Indeed, traditional inductive 
logic programming languages would allow the user to encode background knowl- 
edge (in the form of view predicates or clauses) . They would possess a fix-point 
semantics as well as an entailment relation. 

In this chapter, we first introduce a simple logical data model for mining 
sequences, called SeqLog. It is in a sense the sequence equivalent of the Data- 
log language for deductive databases. Moreover, we provide a formal semantics, 
study the entailment and subsumption relations and provide clause like mech- 
anisms to define view predicates. Through the introduction of the SeqLog lan- 
guage, we put sequential data mining on the same methodological grounds as 
inductive logic programming. This framework may be useful also for other min- 
ing or learning tasks in inductive logic programming. In this context, our lab 
has also developed an approach for analysing SeqLog type sequences based on 
Hidden Markov Models, cf. [10]. 

Next, we will see how the MineSeqLog system uses SeqLog to mine for SeqLog 
patterns of interest in sequential data. MineSeqLog combines principles of the 
level-wise search algorithm with version spaces in order to find all patterns that 
satisfy a constraint of the form a A m where a is an anti-monotonic constraint 
(such as “the frequency of the target patterns on the positives is at least 10%”) 
and a monotonic constraint m (such as the “frequency on the negatives is at most 
1%”). While most attention in the data mining community goes to handling an 
anti-monotonic constraint (e.g. minimum frequency), we argue that it is also 
useful to consider monotonic constraints (e.g. maximum frequency), especially 
in conjunction with an anti-monotonic constraint. Our group came across such 
a need when we developed the molecular feature miner (MolFea) [11]. With a 
conjunction of both kinds of constraints, we are able to instruct MineSeqLog and 
MolFea to find rules and patterns that are frequent in a certain data subset, but 
infrequent in another subset. Such rules are useful for distinguishing patterns 
exhibited by the two subsets. Using the Unix user modelling example, this dual- 
constraint approach allows us to find out command sequences that are frequently 
used by novice users, but seldom used by advanced users. The key design issue in 
MineSeqLog was the development of an optimal refinement operator for SeqLog. 
We will go through this operator in detail. Finally, we validate the approach 
using some experiments in the domain of user modelling. 



^ We follow the Prolog notation here in which identihers for variables start with capital 
letters. 
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This chapter is organized as follows: in Sect. 2, we introduce SeqLog and 
define its semantics; in Sect. 3, we define the mining task addressed in SeqLog 
and provide the main algorithms; in Sect. 4, we define an optimal refinement 
operator for SeqLog (without functors); in Sect. 6, we present some preliminary 
experiments, and finally, in Sect. 7, we conclude and touch upon related work. 



2 SeqLog: SEQuential LOGic 

In this section, we introduce a representational framework for sequences called 
SeqLog. The framework is grounded in first order logic and akin to Datalog 
except that SeqLog represents sequences rather than relations or predicates. 
This also motivates the use of the traditional logical terminology, which we 
briefly review here. 

An atom p(fi, ..., tn) consists of a relation symbol p of arity n followed by n 
terms ti. A term is either a constant or a variable^. A substitution 0 is a set of 
the form { z;i <— ti, ..., <— } where the Vi are variables and the ti terms. One 

can apply a substitution 9 on an expression e yielding the expression e9 which 
is obtained by simultaneously replacing all variables Vi by their corresponding 
terms ti. 

We can now introduce sequences and sequential databases. 

Simple Sequence. A simple sequence is a possibly empty ordered list of atoms. 
The empty list is denoted by w. A few examples of simple sequences are: 

1. latex (kdd,tex) xdvi(kdd,dvi) dvips(kdd,dvi) Ipr (hpml,kdd) 

2. latex(FileName,tex) xdvi (FileNamie ,tex) 

3. helixC'A’, h (right , alpha) , 7) strandC ' SA’ , ^ A’ , 1 ,0, 6) 

Simple sequences are used to represent data in a sequential database. 

Complex Sequence. A complex sequence is a possibly empty sequence of 
atoms separated by operators Iq opi l\ op 2 h op^...opn In- The two oper- 
ators that are employed are <1 (which we often omit for readability reasons) 
and <. The former, i.e. O, denotes the ’direct successor’ operator, the latter, 
i.e. <, encodes the transitive closure of <1. An example of a complex sequence, 
is latex (FileName ,tex) < dvips(FileName,dvi). It states that the atom 
dvips(FileName,dvi) occurs somewhere after latexlFileNamie ,tex) . Com- 
plex sequences are used to represent sequential patterns. In this chapter, we 
use the term “query” interchangeably with “pattern” . 

Heads and Tails of Queries. The head{q) of a complex sequence q denotes 
the maximal prefix of q that does not contain the operator <. The remainder 
of the query will be referred to using tail(q). E.g. \iq = abc<de then 
head{q) = a b c and tail{q) = d e. 

Given two sequences q and s we can define a notion of subsumption. 



^ In principle, we could also allow for functors. However, our optimal refinement op- 
erator (Sect. 4) is optimized for the case without functors. 
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Simple Subsumption. A simple sequence lohh ■ ■ - In s-subsumes a sequence 
Sq opi ... opm Sm if and only if there exists a substitution 6 such that 
{Iq l\ I 2 ... ln)0 is a subsequence of so...Sm, i-C. : Iq 0 = Si A ... A = Si+„ 

and opi+i = ... = opt+n =<• 

Subsumption. A complex sequence q subsumes a sequence Sq opi ... opn Sn if 
and only if there exists a substitution 0 and an integer i such that head{q)6 
s-subsumes sq opi ... opi Si and tail{q )9 subsumes 

Under this definition, a simple sequence s can only subsume sequence c if c 
has a simple subsequence c' such that s U c'. 

For example, the sequence s\ = b c{X) s-subsumes the sequence S2 = 
a b c{p) d{q) with the substitution {X 1— > p}. The sequence S3 = a b < c{X) 
subsumes S2 and also S4 = a b < c(p) d(q) with the same substitution. However, 
S3 does not subsume S3 = a < b c{p) d{q), as head^ss) = a b does not s-subsume 
any fragment of S5. 

When a sequence si subsumes another sequence S2 we will write si U S2- The 
introduced notions of subsumption will be useful to query sequential databases 
(cf. below) and also to reason about the generality of patterns. 

Sequences in SeqLog correspond to base predicates in Datalog. Akin to Data- 
log, we allow the user also to specify view predicates in terms of queries. 

A Sequential Clause is an expression of the form h ^ q where /i is a literal 
and (j is a (possibly complex) sequence. Predicates appearing in the conclu- 
sion part of clauses will be called view predicates. 

Notice that sequential clauses can be recursive. 

A Sequential Database D consists of a set of sequential clauses, clauses{D), 
and a set of sequences, sequences{D). 

By now, we have everything available to define the semantics of SeqLog. This 
is realized by analogy to the well-known Tp operator in computational logic. 

Definition 1. Let Q be sequential database. Then 



Tq{S) = 5 U {so Opi ... OPi SiOPi+ih9 opj Sj ... Sm \ so opi ... opm Sm & S 

A 3 {h ^ li...lk) G clauses{Q) , 9 : l\9 = s^+i A lk9 = Sj-iA 

{ll...lk)9 U Si+i OPi+2 ■■■ OPj-l Sj-l} 

We can then inductively define Tq{S) = Tq{Tq~^{S)). The fix-point is then 
Tq{S). It defines the meaning of the database S. 

Definition 2. A sequence s is logically entailed by a sequential database Q, 
notation Q \= s9 if and only if 3s' G Tq {sequences{Q)) such that s subsumes s' 
with substitution 9. 
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Sequences of the form s can be regarded as queries in the above definition. 

Consider for example the following sequential database: { “6 a a” , “p <— a” , 
“p ^ a p”}. Then T^({6 a a}) = {b a a,b p a,b a p,b p p,b p}. We would 
therefore have that Q \= b < p and Q \= b a < p. 

At this point, we wish to stress that resolution type mechanisms can be 
employed in order to obtain algorithms for reasoning about entailment. Also, 
s-subsumption can be decided in time polynomial in the length of the sequence, 
whereas testing for subsumption is NP-complete. Though we will not elaborate 
much further on the use of SeqLog outside the context of constraint based mining, 
the reader familiar with inductive logic programming should notice that the 
entailment relation could be employed as a coverage relation. This, in turn, 
allows us to adapt the traditional inductive logic programming settings for use 
with SeqLog instead of Prolog. In this way it is straightforward to define a 
concept-learning task for sequential data in SeqLog. 

3 Constraint Based Mining in SeqLog 

3.1 Conjunctive Constraints 

Now that we have defined our SeqLog formalism, it becomes possible to mine 
sequential data. The data mining task addressed uses a background theory KB 
(a set of clauses in SeqLog) as well as a set of sequences D (possibly divided into 
subsets Di, Dn). The aim is then to find all sequences satisfying the specified 
constraints with regard to KB and D. A variety of constraints can be used. 
The only requirement is that the overall constraint a Am can be written as the 
conjunction of a monotonic m and an anti-monotonic component a. 

Definition 3. A constraint p is anti-monotonic if and only if V sequences x : 
{x C y) Ap{y) ^p{x). 

Definition 4. A constraint p is monotonic if and only if V sequences x \ {x 
y) A p{x) ^p{y). 

For example, the “minimum frequency” constraint commonly used in data 
mining is an anti-monotonic one. In particular, since any sequence that satisfies 
a pattern y must also satisfy all patterns a; C y, the frequency for y must be no 
higher than that for x. Thus, if the minimum frequency constraint is satisfied for 
pattern y, it must also be satisfied by all x Q y. On the other hand, a “maximum 
frequency” constraint is monotonic. 

Notice that this is a quite expressive framework given that one can compose 
complex anti-monotonic (resp. monotonic) constraints on the basis of simpler 
ones. This is because the conjunction and disjunction of a set of anti-monotonic 
(resp. monotonic) constraints is still anti-monotonic (resp. monotonic), whereas 
the negation of an anti-monotonic (resp. monotonic) constraint is monotonic 
(resp. anti-monotonic). Within MineSeqLog, the following primitives are directly 
supported. They are inspired by similar primitives for simple sequences in the 
molecular feature miner MolFea [11]. 
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— T Q p, p Q T, ~^(T C p) and ~^{p C T): where T is the unknown target 
query and p is a logical sequence; this type of primitive constraint denotes 
that T should (resp. should not) subsume the sequence p; e.g., the constraint 
a b cQT specifies that the target pattern T should be subsumed by a 6 c. 

— freq{T,E) denotes the frequency of a pattern T on a set of sequences E; 
the frequency of a pattern T on a data-set E is defined as the number of 
sequences in E that T matches (possibly taking into account clauses in the 
background knowledge B). More formally, we have 

freq{T,E) = \{eGE\BU{e}^T}\ 

E.g., the frequency of a < 6 on the data set E = {a b c, a c b, a c} is 2. 

— freq{T,Ei) < ci, freq{T,E 2 ) > C 2 where the Ci are positive integers and Ei 
and E 2 are sets of sequences; this constraint denotes that the frequency of 
T on the data-set Ei should be larger than (resp. smaller than) or equal to 
Ci] e.g., the constraint freq{T, Pos) > 100 denotes that the target patterns 
T should have a minimum frequency of 100 on the set of positive sequences 
Pos. 

These primitive constraints can now conjunctively be combined in order to 
declaratively specify the target queries of interest. Note that the conjunction 
may specify constraints w.r.t. any number of data-sets, e.g., imposing a minimum 
frequency on a set of positive sequences, and a maximum one on a set of negative 
ones. E.g., we can express the questions “What Unix command sequences are 
typically used by experts only?” with a = ^'freq{s, expert) > thresholdi” and m = 
“/reg(s, novice) < threshold 2 ”, with databases “expert” and “novice” denoting 
sequences of Unix commands used by expert users and novice users, respectively. 

3.2 Characterizing the Solution Space 

It is well-known that the space of solutions Sol{q) to certain types of constraints 
can be represented using boundary sets. E.g., Hirsh [12] has shown that sets that 
are convex and definite can be represented using their boundaries; Mannila and 
Toivonen [13] have shown that the solution space for anti-monotonic queries can 
be represented using the set of minimally general elements. Furthermore, various 
algorithms exploit these properties for efficiently finding solutions to queries, cf. 
Bayardo’s MaxMiner [14] and Mitchell’s candidate-elimination algorithm [15] as 
well as our MolFea system [11]. 

The boundary sets, sometimes also called the borders, are the most specific 
(resp. the most general) patterns within (or just outside) the set. More formally, 
let P be a set of patterns. Then we denote the set of minimal (i.e. minimally 
specific) patterns within P as min(P). Dually, max(P) denotes the maximally 
specific patterns within P. In Mannila and Toivonen’s terminology this corre- 
sponds to the positive border PP+(P). 

In the remainder of this paper, we will be largely following Mitchell and 
Hirsh’s notation and terminology. We will also assume that the cardinality of 
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£ is finite (finiteness implies that all subsets of £ are definite, one of the two 
requirements for having boundary set representability, cf. [12])^. 

Definition 5. The S-set w.r.t. a query q G Q is defined as S{q) = max{Sol{q)) = 
{s G Sol{q)\$s' G Sol{q) A s Q s' A s' % s}. The G-set w.r.t. a query q G Q is 
defined as G{q) = iniii{Sol{q)) = {s € Sol{q)\$s' G Sol{q) A .s' Q s A s % s'}. 

Definition 6. A set T is upper boundary set representable if and only if T = 
{t £ £ I 3s € max(T) : t [£ sj; it is lower boundary set representable if and only 
if T = {t G C \ 3g G min(T) : g C t}; and a set T is boundary set representable 
if and only if T = {t G C \ 3g G min(T), s G max(T) : g E ^ E sj. 

Sets that are boundary set representable are sometimes also referred to as 
version spaces. It is well-known that finite solution spaces of constraints of the 
form a A m with a an anti-monotonic and m a monotonic are boundary set 
representable, cf. [12,16,11]. 

The key problem that is remaining now, is how to compute the version space 
for a MineSeqLog query of the form a Am. One approach would be to employ the 
level wise version space algorithm employed in MolFea [16,11]. This algorithm 
combines the well-known level-wise algorithm with the description identification 
algorithm of [17]. In this paper, we present another approach, the MineSeqLog 
algorithm (Algorithm 1). This algorithm reuses a level- wise algorithm for dis- 
covering patterns under anti-monotonic constraints only, such as Apriori [7] 
or MaxMiner [14]. Reusing these algorithms has the advantage that they are 
already well studied, with many optimization and implementation techniques 
available. Further improvements to these algorithms automatically applies to 
our approach, too. We present our version of MaxMiner, FindMaximalPatterns, 
in Sect. 5 (Algorithm 2). 

FindMaximalPatterns is called twice, as sketched in Algorithm 1. In the first 
invocation, FindMaximalPatterns finds the set of maximal patterns for con- 
straint a. The second invocation finds the set of maximal patterns satisfying 
-im, which is also a set of minimal patterns just not satisfying m. 

It is easy to see that 

Sol{a Am) = {p \3u G U \ p\—u and ^3/ G £ : p E 

This directly follows from the observation that the negation of a mono- 
tonic constraint m is anti-monotonic. Furthermore, we want to find those pat- 
terns p that satisfy the anti-monotonic constraint a (hence 3u gU '. p Gu) and 
do not satisfy m (hence ^3/ G L : p G 1). Not all the elements of U and L found 
in lines 8 and 9 are useful, however. For any u G U such that we can find an 
I G L satisfying u G I, we observe that if there is any pattern p G u, we have 
automatically p G u G I and hence p Sol{a A m). So, such a rt is redundant. 
It is thus pruned in line 11. On the other hand, an I G L is redundant if for 



^ Finiteness can be gnaranteed by imposing a finite upperbound on the length of the 
sequences in £. 
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all patterns p Q I, there is no u C U such that p Q u. This is equivalent to 
mgg{l,u) = {w} Vu G So, these elements of L are pruned in line 12 above. 

Regardless of whether one applies the level-wise version space algorithm or 
the bi-directional MaxMiner approach, it is crucial for efficiency reasons, to em- 
ploy a so-called optimal refinement operator. This is elaborated in the next 
section. 



Algorithm 1 MineSeqLog 
1: /* Input: */ 

2: /* a = an anti-monotonic constraint */ 

3: /* m = a monotonic constraint */ 

4: /* Output: */ 

5: /* U = the set of maximal patterns satisfying a A m */ 

6: /* L = the set of minimal patterns not satisfying a A m */ 

7: 

8: compute U = S{a), the set of maximally specific patterns satisfying a using Find- 
MaximalPatterns. 

9: compute L = S{~^m) using FindMaximalPatterns; the set of maximally specific 
patterns not satisfying m. 

10: /* prune U and L : */ 

11: Prune away all u € f/ satisfying: 3/ £ L,u C L 

12: Prune away all / € L satisfying: Vu € U,mgg{l,u) = {a;} where mgg{x,y) signifies 
the set of the minimally general generalizations of two patterns. 



4 An Optimal Refinement Operator for SeqLog 

4.1 Optimality 

A refinement operator is an operator p that maps each pattern p to a set of 
specializations of it, i.e. p{p) C {p' g £ | p C p'}. With such an operator, we can 
then employ e.g. level-wise algorithms to generate and test patterns that satisfy 
the anti-monotonic constraints. 

For optimality, we further require that: 

Completeness. Applying the operator p on p (possibly repeatedly), it is 
possible to generate all other patterns that p subsumes. In other words, 
€ £ I p C p'}. This requirement guarantees that we will not 
miss any patterns that may satisfy the constraints. 

Single Path. Given pattern p, there should exist exactly one sequence of pat- 
terns po = w,pi, . . . ,p„ = p such that Pi+i S p{pi) for all i. This requirement 
helps ensuring that no query is generated more than once, i.e., there are no 
duplicates. 



^ Here, we assume that to 0 Sol{a A m). If w € Sol{a A m), then we would have 
a{uj) A m(uj) m{uj) V(p € £ by the monotonicity. This would 

mean that m is a trivial predicate that is always true and hence Sol{a A m) would 
degenerate to Sol{a), for which L is simply the empty set. 
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When working with propositional patterns, such an optimal refinement oper- 
ator is rather straightforward to devise. However, when working with first order 
expressions, optimal refinements operators may not always exist [18]. Further- 
more, the use of naive (non)-optimal refinement operators may lead to severe 
efficiency problems, cf. the work by Nijssen and Kok [19], who elegantly solved 
many of the efficiency problems with the Warmr system [20]. 

Therefore, in the remainder of this section, we elaborate on the optimal re- 
finement operator that we have developed for functor-free SeqLog. 

4.2 Basic Ideas 

Four operations are identified for refining a certain SeqLog query Qg into a more 
specific one Qs- More precisely, Qg E Qs. In other words, the Qs so generated is 
the most general specialization of Qg. We will use the query Qg = a{X) < b{Y) 
as an example. 

Lengthening. Add a new atom to the query, thus increasing the length of the 
query by one literal. The new term can be inserted into any position with 
any operator. E.g. Qs = a{X) < b(Y) < c(Z) or Qs = a(X) c(Z) < b(V). 
Promotion. If the query has a < operator, replace it with <1. For example, 
Qs = a(X) b(Y). 

UniVar. Pick any two variables from the query and unify them. For instance, 
Qs = a(X) <b(X). 

Instantiation. Pick any variable from the query and replace it with a term. 
E.g. Qs = a{f{U,V))<b{Y). 

Note that for Lengthening and Instantiation, we need to introduce new terms 
to form the refined query Qs- Where shall we draw these from? In practice, we re- 
strict the set of all possible SeqLog queries by specifying a list of predicate names, 
together with their arities, which may appear in the generated sequences. Terms 
for predicates not appearing in this list are not generated by our refinement 
operator. 

Furthermore, the operator UniVar in the above form causes problems in prac- 
tice. With n variables in Qg, there are ( 2 ) = 0{n^) possible pairs to unify. With 
a large number of possible queries, this becomes intolerable. Therefore, in our 
MineSeqLog system, we allow the user to specify types for the arguments of the 
terms as in many ILP systems, e.g. Warmr [20]. Arguments of different types will 
never be unified. This helps preventing the generation of nonsense queries such 
as Qs =“cook(Personi, Food) < eat(Food, Food)” formed by unifying variables 
“Food” with “Person 2 ” from Qg = “cook (Person 1 , Food) < eat(Persou 2 , Food)”. 
Types can also be used in MineSeqLog to restrict the terms that are used for 
the Instantiation of variables, so that only meaningful values will be used for the 
substitution. 

With these four basic operations, we can create a refinement operator that 
takes a query Qg and applies each operation (in multiple possible ways) to 
generate new queries Qs ■ Such a refinement operator will satisfy the completeness 
criterion. However, it generates a lot of duplicates and is hence not satisfying the 
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single path criterion. For instance, the specialization a < b < c may be generated 
from a < b, a < c or b < c. 

We define a measure, the refinement level vector, on each query. The measure 
helps us define restrictions on the above refinement operations to ensure that no 
duplicates are generated. 

4.3 Refinement Level Vector 

Given any functor-free SeqLog query q , the refinement level vector v(q) is defined 
as a 4-dimensional vector of the form (l,p,u,i), where: 

— I is the number of predicates in q. 

— p is the number of <1 operators in q. 

— M is the number of arguments in q minus the number of distinct variables (but 
not constant arguments) in q minus the number of constant arguments in q. 

— f is the number of constant arguments in q. 

Furthermore, we call the first norm of this vector, ||t>(( 7 )||i = I + p + u + i 
the refinement level of q. Our goal is to make sure that given a query q, any of 
the four refinement operations on q will always generate a new query q' so that 

ll«(9')lli = ll^(9)lli + l- 

The refinement level vector v(q) for any given query q has the following 
properties: 

— v{q) = (0, 0, 0, 0) if and only if q is the empty query. 

— All the 4 subordinate values of v{q) are non-negative integers, for all valid 
queries q. 



4.4 Duplicate Avoidance 

In order to avoid the generation of duplicates and to satisfy the single-path 
criterion, we need to add restrictions to our 4 refinement operations. Because 
of space limitations, we can only briefly present the restrictions in this paper. 
Below, q denotes the query being refined and q' denotes a refined query. The 
symbols l,pfi,u denote any non-negative integers. 

Lengthening. Apply only if v{q) = (/, 0,0,0). Moreover, new atoms are only 
added at the end with the < operator. Result: v{q') = (Z -|- 1, 0, 0, 0). 
Promotion. Apply only if v{q) = {l,p, 0, 0) When promoting an operator from 
< to <1, only do so when this operator is followed by no other <l operators. 
This is required to avoid duplication. For example, “aoboc” could be gener- 
ated from both “aob <c” and “a<b Oc” by Promotion. However, the above 
restriction forbids the latter, and only allows “aoboc” to be promoted from 
“aob <c”. Result: v{q') = {l,p+ 1,0,0). 

LFniVar. Apply for all v{q) = (l,p,u,i). Of the two variables chosen to be uni- 
fied, one of them must be not yet unified with any other variables. Moreover, 
this variable must not be followed by any other already unified variables. 
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These constraints are needed to avoid duplicates as illustrated in Fig. 1, 
which illustrates the possible paths through which a query f(X,X',Y,Y') 
can be refined to f{X, X, X, X) using UniVar. The above restrictions prune 
away all paths in dashed lines, leaving a unique spanning tree rooted at 
query f{X,X\Y,Y'). Result: v{q') = (l,p,u + l,i). 




Instantiation. Apply only if v{q) = {l,p,0,i). Moreover, in q no other argu- 
ments to the right of the variable to be instantiated should be the result 
of a previous Instantiation, i.e. successive instantiations are performed from 
left to right. This restriction is required to avoid duplicates as illustrated in 
Fig. 2. The figure illustrates the different possible paths to refine f{A, B, C) 
to /(a, b, c). The restriction prunes away the paths shown with dashed lines, 
leaving a spanning tree rooted at f{A,B,C). Result: v{q') = {l,p,0,i + !)• 



( 1 ) 




Fig. 2. Example of duplicate avoidance for Instantiation (DataLog case) 
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Note that each of these four operations increases exactly one ordinate of the 
refinement level vector by one. Each of these operations has a corresponding 
inverse operation, which decreases the corresponding ordinate in the refinement 
level vector by one. Given any valid sequential query, we can apply the inverse 
of UniVar successively to generalize it, decreasing its u value by one at a time, 
until the u value reaches zero. We can then apply the inverse of Instantiation 
to decrease the value of i until it becomes zero. Similarly, we can decrease the p 
value to zero by the inverse of Promotion and the value of I to zero by the inverse 
of Lengthening. This sequence of inverse operations corresponds to a reversed 
sequence of refinement operations that refines the original query from the empty 
query. So, the original query can be generated by refining the empty query. This 
is true for any valid sequential query. Therefore, any valid sequential query can 
be refined from the empty query. Our refinement operation is thus complete, 
despite the additional restrictions. 

Moreover, the sequence of refinements described above to obtain any query 
from the empty query is also unique, because of the restrictions added to the 4 
refinement operations. So, the single-path criterion is also met. Thus, our refine- 
ment operator is optimal. As an example, consider the query “q^ = a(X) b{Y, g) < 
c{h,Xy\ with a refinement level vector v{q^) = (3, 1,2, 1). The only way to get 
this query using our refinement operators is from “ge = a{Xx) b{Y, g) < c{h, A2)” 
by Unification, where v^qg) = (3, 1,2,0). This in turn must have come from 2 
Instantiations, via “(75 = a{X\) b{Y,g) < c(iL, A2)” and “(74 = a{Xi) b{Y,G) < 
c{H,X 2 y\ where v{q^) = (3, 1,1,0) and v{q 4 ) = (3, 1,0,0). The only way to 
obtain (74 is a Promotion from “(73 = a(Xi) < b{Y, G) < c{H, X 2 )” , with ^(gs) = 
(3,0,0, 0). This must have come from Lengthening, via “(72 = a(Ai) < biY^G)" 
(where ^((72) = (2, 0,0,0)), and “gi = a(Ai)” (where f(gi) = (1,0, 0,0)). Even- 
tually, we get <7i by Lengthening the empty query go = which has v(qo) = 
(0, 0, 0, 0)). It is left as an exercise for the reader to check that go = w, gi, . . . , gy 
is the only way to obtain gy from go using our optimal refinement operator 
repeatedly. 

4.5 Further Optimizations 

Using the idea from the Apriori and the apriori_gen function, the Lengthening op- 
eration can be further optimized when dealing with anti-monotonic constraints. 
We can perform Lengthening on the set Q of all g with v(q) = I in batch and 
generate the set Q' of all g' with v{q') = l + l. We first sort the elements of Q in 
lexicographical order. Then, for every pair of g € Q queries sharing a common 
length I — 1 prefix, join them to form 4 length I + 1 candidate queries, e.g. the 
pair “a <b <c” and “a <b <d” produces “a <b <c <c”, “a <b <c <d”, “a 
<b <d <c” and “a <b <d <d”. All such generated queries, are collected and 
inserted into the set Q' . The monotonic property guarantees that all suitable 
SeqLog queries of length I + 1 are found inside Q' . Moreover, this generation 
method never generates any query g' with v{q') = {l+l, 0, 0, 0) more than once. 

Another optimization when dealing with only anti-monotonic constraints is 
to consider only those atoms which, when treated as singleton sequences, satisfy 
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the constraints. This is because sequences containing any other atoms will not 
satisfy the constraints. (If they do, then so will the singleton sequences formed 
from these atoms because of the anti-monotonicity — a contradiction.) E.g., when 
finding frequent itemsets, only frequent items need to be added when generating 
candidate itemsets of larger sizes. 



5 The MineSeqLog Algorithm 

Having an optimal refinement operator, we can now devise an efficient algorithm 
for solving the data mining problem specified in Sect. 3. We call this algorithm 
MineSeqLog. 

The inputs to the algorithm are two constraints: a and m, where a is anti- 
monotonic and m is monotonic. These constraints are predicates on SeqLog 
queries against a given database D (or a particular subset of it). The algorithm 
then discovers all SeqLog queries that satisfy the given constraints. The algo- 
rithm also takes a set F as input. This specifies the set of predicate names (with 
the specified arities) to be used to generate the terms in the SeqLog patterns. 

The MineSeqLog algorithm is already given in Sect. 3.2, and is thus not re- 
peated here. There, we mentioned a FindMaximalPatterns algorithm, which is 
listed in Algorithm 2 . Most of the work of MineSeqLog is delegated to this algo- 
rithm. This algorithm requires that the input constraint a be anti-monotonic. It 
is an iterative level-wise discovery algorithm based on the same framework as the 
well-known Apriori algorithm [7]. In iteration k, FindMaximalPatterns generates 
a candidate set of potential patterns and then tests them against the given 
anti-monotonic constraint a, possibly scanning the involved database subsets. 
Those that satisfy the constraint are added to L*. C\ is the set of interesting 
atoms generated from F. In subsequent iterations, Ck is generated by applying 
the GenerateCandidates algorithm described later. This latter algorithm takes a 
set of patterns of the same refinement level and uses the optimal refinement op- 
erator described in Sect. 4 to generate the patterns of the next refinement level. 
For efficiency reasons, we keep the refinement level vector v{q) together with each 
pattern q in Ck as well as Lk- We will later see that we can compute the refine- 
ment level vectors easily without calculating them from scratch in each iteration. 

The GenerateCandidates algorithm (Algorithm 3) takes as input a set of 
interesting patterns of the same refinement level (i.e. the sum of the coordi- 
nates in the refinement level vector, see Sect. 4.3) and computes their optimal 
refinements. The refined patterns are all of a refinement level one higher than 
that of the input patterns. For efficiency, the input and output sets are actually 
ordered pairs containing the pattern as well as its refinement level vector. Gener- 
ateCandidates can compute the refinement level vectors of the newly generated 
Candidates immediately, without having to use the definition given in Sect. 4.3. 

The GenerateCandidates algorithm given here does not apply the Lengthen- 
ing operation (Sect. 4.2) directly. Rather, it employs the optimization described 
in Sect. 4.5. This uses the join operation in Apriori to efficiently compute the 
lengthened candidates. 
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Algorithm 2 FindMaximalPatterns 
/* Input: */ 

/* a = an anti-monotonic constraint */ 

D = the database to be mined */ 

/* F = a set of predicate names with arities */ 

/* Output: S = the set of maximal patterns satisfying a */ 

/* First Iteration: */ 

Cl <— {(/(Ai, X 2 , . . . , X„), (1, 0, 0, 0)) I (/, n) e F} /* (1,0, 0,0) is the refinement 
level vector */ 

Li <— {{x,v) € Cl I pattern x satisfies constraint a} 
k^2 

loop /* k-th Iteration: */ 

Cfc <— GenerateCandidates(Lfc_i) 
if Cfc = 0 then 
exit loop 

end if 

Lk ^ {( 2 ;, v) £ Ck\ pattern x satisfies constraint a} 

end loop 

/* Results: */ 

return S <— max(|jJCj Lk) 



Algorithm 3 GenerateCandidates 
/* Input: */ 

/* Lk-i = the set of interesting patterns discovered at the previous level */ 
/ * Output: Ck = the set of new candidates * / 

Fk-i ^ {x\{x,v) € Lk-i f\v = {k - 1,0, 0,0)} 

Fk ^ Join(Ffc_i) /* Using the idea of Apriori’s join */ 

Ck^{{x,{k,0,0,0)\x£Fk} 

for all {x,v) £ Lk-i do 

if V has the form (l,p, 0, 0) then /* candidate for promotion */ 
apply the Promotion operation on x to get y 
if successful, Gfc <— Gj, U {y, {l,p + 1, 0, 0)) 

end if 

if V has the form (l,p,i,0) then /* candidate for instantiation */ 
apply the Instantiation operation on x to get y 
if successful, Ck ^ CkO {y, (l,P,i + 1,0)) 

end if 

if V has the form (l,p,i,u) then /* candidate for univar */ 
apply the UniVar operation on x to get y 
if successful, Ck *— Ck 0 {y, {l,p,i,u + 1)) 

end if 
end for 

return Ck- 
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6 Experiments 

We have implemented the algorithms in Sect. 5 in SICSTUS Prolog and per- 
formed some preliminary experiments on a Unix command history database 
[21,8] to test out our ideas. The implementation uses the refinement operator 
in Sect. 4 and hence discovers functor- free SeqLog queries. The database is a 
set of simple (i.e. without the < operator) SeqLog facts, although the patterns 
discovered are complex sequences (which may contain the < operator). 

6.1 The Unix Command History Data 

The database is obtained from 168 users of Unix csh. The Unix commands that 
they used over a period of timer were recorded. These are represented as a set 
of simple SeqLog facts. Each command is represented as an atom with exactly 2 
parameters: the command arguments an the directory from which the command 
is invoked. The sequence of commands used by a single user in a single login 
session is represented as a SeqLog sentence. For example, if user Mary logs in 
and invokes the commands cd myprog. Is * . c, mail, exit, this login session is 
represented by the SeqLog sequence: 

cd ( ’ myprog ’ , ’ /home/mary ’ ) 

Is ( ’ * . c ’ , ’ /home/mary/myprog ’ ) 
mail( ’ ’ , ’ /home/mary/myprog’ ) 
exit ( ’ ’ , ’ /home/mary/myprog’ ) 

A large number of login sessions of different users are gathered. These users 
are divided into four groups: novice programmers (nov), experienced program- 
mers (exp), non-programmers (non) as well as computer scientists (sci). This 
allows us to partition the database into 4 disjoint subsets. Table 1 shows a sum- 
mary of some measures on the data. 



Table 1. Summary statistics of the data 



Subset {D) 


users 


sequences 


threshold {9) 


time (s) 


patterns 


\M\ 


nov 


55 


5164 


80 


676 


796 


38 


exp 


36 


3859 


100 


604 


834 


61 


non 


25 


1906 


200 


890 


533 


100 


sci 


52 


7751 


250 


2345 


208 


74 



6.2 The Mining Task 

The anti-monotonic constraint we used in the experiment is frequent(p, U; 0), 
where p represents the pattern, i.e. SeqLog query, in question and D represents 
the database subset (one of {nov, exp, non, sci}). It evaluates to true if the 
frequency of pattern p in the database subset D is no fewer than 0 (see Sect. 3.1). 
For simplicity, the monotonic constraints that we used are simply the negation 
of “frequent”. Moreover, we have assigned a threshold value to each data subset 
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and always used that threshold for it. These values are shown in table 1. Many 
different threshold values have been attempted and we present here those that 
yield a moderate amount of patterns. The database is preprocessed to obtain the 
150 most frequent commands. These, with an arity of 2, form the set F. Thus, 
the mining algorithm will only generate patterns out of the atoms corresponding 
to these commands. The results of applying FindMaximalPatterns to these data 
sets are also shown in table 1. The “time” column shows the wall-clock time 
taken for the algorithm to complete on a Pentium III (600MHz) PC running 
Linux kernel 2.4. The “patterns” column shows the number of patterns found to 
satisfy the given constraints. The “|M|” column shows the number of patterns 
in the set of maximal patterns. 

We then used our MineSeqLog program to discover interesting patterns from 
the database. For each pair of distinct database subsets Di and H 2 , we denote 
by D\ : D 2 the constraint obtained from the conjunction: 

frequent(p, Di; 0c J A ^frequent(p, H 2 ; ^Ca) 

e.g. when D\ is ‘nov’ and D 2 is ‘exp’, this constraint will cause MineSeqLog 
to discover command sequences that are often used by novice programmers but 
only seldom used by experienced programmers. Each such constraint is then 
specified to MineSeqLog, which then discovers all interesting patterns satisfying 
the constraint. The results are summarized in table 2. In each case, the table 
shows the number of patterns found and the sizes of the sets U and L (see 
Sect. 3.2) before and after pruning. The length of the longest sequence was 4 in 
all cases. At this point the reader may have noticed that the size of C/ U L can 
be larger than the number of solutions. For version spaces, the size of S' U G can 
be at most twice the number of solutions. At present we are still investigating 
this issue for U and L as well as whether more elements from U and L could be 
pruned. 

Table 2. Results of running MineSeqLog 



Constraint 


number of 
patterns 
found 


before 

pruning 


after 

pruning 


\U\ 


\L\ 


\U\ + \L\ 


\U\ 


\L\ 


\U\ + \L\ 


nov: non 


679 


38 


100 


138 


37 


29 


66 


nov: exp 


401 


38 


61 


99 


27 


41 


68 


nov:sci 


716 


38 


74 


112 


33 


62 


95 


exp: nov 


609 


61 


38 


99 


35 


36 


71 


exp: non 


686 


61 


100 


161 


58 


100 


158 


exp:sci 


756 


61 


74 


135 


51 


68 


119 


non: nov 


489 


100 


38 


138 


93 


13 


106 


non:exp 


479 


100 


61 


161 


93 


27 


120 


non:sci 


501 


100 


74 


174 


98 


61 


159 


sci:nov 


161 


74 


38 


112 


68 


30 


98 


sci:non 


193 


74 


100 


174 


71 


95 


166 


sci:exp 


160 


74 


61 


135 


63 


49 


112 
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6.3 Experimental Results 

Some interesting example patterns in the results include: 

~ The sequences discovered in non:exp formed a subset of those in non:sci and 
non:nov. (E.g. The sequence “nroff (A,B) < wordproc(C,D)” is common to 
all these result sets.) The sequences discovered in nov:exp formed a subset of 
those in nov:non. (E.g. “more(A,B) more(C,D) < more(E,F) < more(G,H)” 
is common to both.) The result of sci:exp is also a subset of sci:non. (e.g. They 
both contain “more(A,B) < more(C,D) < more(E,D)” .) Perhaps, experienced 
programmers tend to use a larger set of command sequences, so that there 
are fewer sequences that are infrequent in their data set. 

— The sequences in non:sci and sci:non are disjoint. (E.g. In non:sci, the 
following pattern is found: “emacs(A,B) < wordproc(C,B) < emacs(D,B) 

< wordproc(E,B)” . It does not occur in sci:non. However, sci:non contains 
“mailnews(A,B) < mailnews(A,B) < mailnews(A,B)”, which is not found 
in non:sci.) This suggests that non-programmers use command sequences 
very different from those of computer scientists. 

— The sequences in nov: exp and exp: nov are also disjoint. (E.g. “script(A,B) 

< script(A,C) < script(A,B) < script(A,B)” is found in nov:exp but 
not in exp:nov. On the other hand, “make(A,B) < make(C,B) < make(C,B) 

< make(C,B)” is found in exp:nov but not nov:exp.) Maybe, novice and 
experienced programmers tend to use very different command sequences, 
(e.g. experienced programmers tend to master the make utility.) 



7 Related Work 

There is plenty of related work to MineSeqLog. First, there are the already men- 
tioned propositional approaches to mining sequential data. Secondly, there is the 
approach by Mannila and Toivonen [9] that considers a form of relational sequen- 
tial patterns in which there is only one type of event, no background knowledge, 
the order must be expressed with a total order relation on a special time at- 
tribute and instead of unification, equality must be expressed with an explicit 
relation. Thirdly, there is the work by [22], who also investigate logical sequential 
patterns. However, Masson et al. extend the Spirit system by [4], which employs 
a grammar as a syntactic bias mechanism to mine propositional sequences. Mas- 
son’s model cannot express the equivalent of our < operator, cannot express 
background knowledge in the form of clauses, and is limited to minimum fre- 
quency constraints only. Fourthly, there is the traditional work on frequent query 
discovery in multi-relational data mining and inductive logic programming, e.g. 
[20,23]. Although one could in principle also use these techniques to find sequen- 
tial patterns, it would be less efficient to do so. Indeed, the typical inductive logic 
programming approach to representing sequential data consists of adding to each 
fact a number (that denotes the element in the sequence). If one then wants to 
express that one fact comes after the other, one must add predicates of the type 
successor or after together with the relevant arguments. This leads to a larger 
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search space and also does not exploit possible optimizations for matching (e.g. 
s-subsumption is polynomial). Finally, the work is also related to constraint 
based data mining. In a sense it was motivated by our earlier work on mining for 
linear (or sequential) fragments within molecules in the MolFea system [11]. The 
present approach extends the MolFea framework with more expressive fragments 
but retains the powerful primitives that were present in MolFea. 



8 Conclusions 

We have introduced a logical language for representing and reasoning about 
sequential data. Other researchers (such as Bonner [24]) have also introduced 
languages for processing sequences based on computational logic. However, the 
main contribution of SeqLog is that the notions of subsumption, entailment and 
a fix point semantics are given. These act as direct analogues of correspond- 
ing notions in Prolog. This is important because subsumption and entailment 
are central to inductive logic programming. As a consequence, mining SeqLog 
sequences will be analogous to inductive logic programming. 

The optimal refinement operator (see Sect. 4) discussed in this paper is re- 
stricted to functor-free SeqLog. One future direction is to generalize this re- 
finement operator to work for general SeqLog, by allowing Instantiation with 
general terms instead of just constant terms. Another future direction is con- 
cerned with the use of SeqLog as a representation model for other mining and 
learning tasks. In this context, our lab has already explored the upgrading of 
Hidden Markov Models for use with SeqLog, cf. [10], and we are now interested 
in predicate learning and distance based learning within the SeqLog language. It 
is our hope that the SeqLog framework will inspire further research in databases, 
data mining and inductive logic programming. 
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Abstract. The conflict between resource consumption and query per- 
formance in the data mining context often has no satisfactory solution. 
This is in sharp contrast to the needs of the analysts for interactive re- 
sponse times and has rendered the seamless integration of data mining 
operators into common multiuser database systems a difficult and (so far) 
not very successful task. This paper describes an approach that allows 
to combine preprocessing and data mining operators into one common 
KDD-aware implementation algebra such that interactivity, scalability 
and resource efficiency can simultaneously be achieved. The basic idea 
of our framework is pipelining. However, since there is a danger of block- 
ing pipelines, we introduce controlled ordering-, cardinality- and special- 
value-properties of the data stream across the whole query tree up to 
the complex data mining operators. The framework builds on a spezial- 
ized index that is basically an extension of the UB-Tree and efficiently 
provides various data orderings. These orderings and the remaining prop- 
erties are then exploited by the KDD-algebra operators to release results 
and internal data structures early enough to allow pipelined, resource- 
efficient query processing with interactive response times. This paper 
describes the framework and demonstrates its benefits in preprocessing 
and in the parallel and interactive detection of outliers. 



1 Introduction 

Decision support applications and especially data mining tasks are inconceivable 
without database support. In order to achieve the necessary scalability with 
today’s massive datasets, database systems split query processing into successive 
tasks, process one operator after the other and swap intermediate results to disk. 
Performance optimization is achieved by clever organization of the data on disk, 
proper design of algorithms and operator ordering. However, these techniques 
are insufficient in a KDD environment which is explorative in nature and where 
the analyst needs continuously to reevaluate the results and refine the query. 

Techniques such as the exploitation of access patterns in order to support the 
most common queries, e.g., by one-dimensional indexes delivering a special sort 
order efficiently, or data partitioning in order to allow for data parallel processing 
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in a parallel environment, appear problematic: In explorative processes it is not 
known in advance which index will be useful or which partitioning scheme will 
allow for well-balanced data parallelism. At the very least, optimized KDD sup- 
port needs parallelization schemes and index structures that are less vulnerable 
to changing access patterns. 

Further, analysts in data mining must take decisions on how to proceed while 
the query is still being processed. They cannot wait until the last operator has 
finished but need to interact in a more timely fashion, perhaps to start a new 
iteration cycle as soon as first signs of wrong choice of parameters has been 
detected. The conclusion from this observation is that optimized KDD support 
needs an operator scheduling strategy that achieves scalability under the condi- 
tion of interactivity. 

In this paper, we develop techniques that make database support for large- 
scale query processing in KDD resource-efficient, scalable and interactive. The 
central idea is to provide interactivity through the extensive use of pipelining. To 
achieve blocking-free and balanced pipelining, we split the operators into prim- 
itives and eliminate blocking obstacles by exploiting ordering properties and 
cardinality information of the data stream flowing through the query tree. To 
establish these properties efficiently, we provide a multidimensional index struc- 
ture providing several data orderings that can be used by each operator of our 
KDD query algebra to compute final results on partial data, and to release these 
to the next operator and finally, at the end of the operator chain, to the human 
analyst. At the same time the internal data structures of the operators can be 
cleared so that resource limits can be met without the need for swapping data to 
peripheral storage. This allows for interactive query processing with controllable 
resources and especially for non-blocking pipelining in parallel environments. 

The paper is organized as follows. After a survey of related work, we analyze 
in Section 3 common implementations of typical KDD preprocessing operators 
to determine which properties of the processed data stream are useful to achieve 
our goals. In Section 4, we will describe the concept of data stream quality, 
which formalizes these properties. Section 5 shows how such qualities can be 
provided efficiently. After a short demonstration of the benefits that our frame- 
work provides to query processing, we describe in Section 6 how interactive data 
mining algorithms can benefit from this framework to allow for parallel and in- 
teractive query processing in KDD. A summary, a survey of open problems and 
suggestions for further work conclude the paper. 



2 Related Work 

In recent years, efforts towards database support for KDD concentrated on the 
development of special languages to express data mining queries. Starting with 
the query language DMQL [5] and continuing all the way to the formulation 
of a sophisticated abstract data mining algebra in [9], there have been many 
attempts to design languages for data mining queries. What is still missing is 
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the integration of KDD with database system kernels at the implementation 
level. Despite the lack of integration there have been several attempts to im- 
prove interactivity by delivering early results in query processing. We will now 
study these approaches and discuss how well they combine interactivity with 
the pipelining requirements needed to achieve scalability and resource efficiency 
within the complex analysis chains in KDD. 

From the early beginnings of System R, the idea of improving performance 
by pipelining and achieving it through the exploitation of ordering properties 
within the data stream has been incorporated into database optimizers [16]. Its 
success rests on the existence of frequently used sort orders which are supported 
by suitable one-dimensional indices, an issue quite difficult to enforce in KDD 
as mentioned before. 

Early results without requiring special data stream properties at the input 
can only be obtained by more complex operators. The first representative of this 
type was the pipelined hash join [20]. The work in [7] extends it with out-of- 
core techniques to limit its massive resource consumption. The paper proposes 
to swap parts of the hash tables to disk and defines two processing stages: a 
regular stage that works similar to the original algorithm, and a clean-up stage 
where the portions of the hash table swapped to disk are processed. The work 
in [18] extends this with a third reactive phase, which is activated when regular 
input is blocked. Overall these various phases lead to complex pipeline scheduling 
problems solved by the specialized scheduling scheme in [19]. A similar goal to 
provide early join results is described in [3] for sort-based joins: The authors 
add early joins for the data being in memory during the initial sort and merge 
steps in order to deliver early results. The overall execution time grows slightly 
because of the more complex internal join processing and, in contrast to the 
original sort-merge-join, the data being delivered has no specific order. What is 
more, the rate in which results are delivered vastly varies during runtime. 

The single-operator solutions described so far have in common that they do 
not guarantee specific qualities on the data delivered so far although these could 
be useful for subsequent operators as, e.g., GROUPBY, and produce results in 
vastly varying rates making well-balanced pipelines difficult to maintain. 

A different approach to achieve interactivity in the presence of incomplete 
results is online query processing, where interactivity is traded for accuracy. The 
goal is to provide early results that reflect the trends within the data, and do 
so with an increasing level of confidence as query processing progresses. The 
most prominent operator in this context is the ripple join [4], which provides 
the possibility to adjust the rate at which the inputs are processed to allow the 
user to favor the “most interesting” input. [15] can support this approach by 
an efficient best-effort-reordering in specific cases (without guarantees that all 
relevant data elements have been provided) but this does not solve the scalability 
problem. What is more, long sequences of (inaccurate) preprocessing operators 
like reorderers, joins and aggregations in an analysis chain cannot provide enough 
accuracy for subsequent data mining algorithms. 
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In a parallel environment, where classical data parallelism leads to scalability 
problems in case of data skew, and limits interactivity by splitting query pro- 
cessing into several tasks, pipelining seems to be the only solution in order to 
achieve interactivity. Because of its bandwidth requirements, pipelining was seen 
mainly as a technique for shared memory (SMP) machines, and even there the 
problems were regarded as particularly serious [2] . Nevertheless, there has been 
substantial work in the area of pipelining implementations of complex opera- 
tors [20], cost models [21,17] and optimization strategies [11]. But the problem 
that especially the complex operators block the pipeline and therefore destroy 
interactivity and scalability still remains unsolved. 



3 The Source of the Problems 



Our main thesis at this point is that if we wish to truly integrate data mining 
and database systems, we should integrate the KDD-specific operators into the 
query algebra of database systems. To do so our first task is to analyze the 
common preprocessing operators used in KDD to determine the reasons for 
resource consumption and for the lack of interactivity due to blocking. As seen 
in Figure 1, this includes standard database operators like select, join, etc. as 
well as KDD specific operators (sample, history etc.). For reasons of space, we 
limit the paper to the following four operators: 

— GROUP, which calculates an aggregation function / of tuples coinciding on the 
values of attributes in some fixed list Tjj (like in the known SQL construct) 

— JOIN, which joins two relations using the join attributes Tu and Ty 

— SAMPLE, which provides a randomly chosen sample of g tuples from the re- 
lation 

— NORMALIZE, which does a linear rescaling of attribute A/j. so that the values 
fit into a given interval [a, 6]. 



data selection data preparation data abstraction 
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Fig. 1. KDD operators in our Algebra 

To identify the blocking and resource intensive steps, we have to look deeper 
into the implementations of those operators. 

One possible implementation for the GROUP operator is the following collect- 
Group which aggregates the tuples in R according to the attribute list Tu while 
applying the aggregation function f: 
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collectGroupCi?, Tu , /) 

1. [collect] Collect tuples with the same value in Tjj (e.g. in a 

hash table) 

2. [aggregate] Calculate the aggregated attributes for each 

collected group of tuples 



This implementation consists of two steps, a collect step and an aggregation 
step. Note that the second step runs with little resource demand in linear time 
and is not blocking. 

The sortMergeJoin implementation for the JOIN operator works in a similar 
fashion. It sorts the tuples of both relations according to the join attributes and 
merges them like a zipper: 



sortMergeJoin (i?, S', join attributes Tjj^Ty) 

la. [sort] Sort the first relation R by Tu 

lb. [sort] Sort the second relation S by Ty 
2. [merge] Merge the relations 



Here, the sorting steps have been separated from the rest. Then, the follow- 
ing merge step can be executed, again with little resource demand, and is not 
blocking. Sorting, however, is blocking. 

If we consider the countPickSample implementation for the SAMPLE or the 
rangeMinmaxNormalize implementation for the NORMALIZE operator we observe 
a similar result: 



countPickSample (Relation R, sample size g) 

1. [count] Determine the number M of tuples 

2. [pick] Let 1 of n = ^ tuples pass 

rangeMinmaxNormalize (i?, , new interval [a, 6]) 

1. [range] Determine the minimum and maximum of the values of 

2. [minmax] Rescale the value of A^. to fit into [a, 6] 



Again, the second step can be performed in a non-blocking manner with little 
resource consumption whereas the first step is blocking. 

We may summarize our observations as follows. 

Dividing the Operator Implementations is Often Possible. Many of the 
complex operators seem to possess algorithms that process the data stream in 
two steps. The first step serves to prepare the data for the following steps, 
which will then process it in a fast, non-blocking manner with little resource 
consumption. 

Data Preparation Steps are Limited. There seem to be only a few standard 
operators for data preparation steps. In the examples presented there were or- 
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dering according to selected attributes (achieved by [sort]), grouping by selected 
attributes (achieved by [collect]), counting tuples and determining the range of 
values of an attribute. In principle, it is possible to treat those preprocessing 
steps as separate operators. What is important is that the preparation levels 
attained by executing a single pre-processing step may be useful in more than 
one subsequent processing step, provided it is not destroyed by the following 
step. This opens the chance to avoid the preparation step in some cases. 

Avoiding Data Preparation is the Clue to Interaction. Blocking oper- 
ators can slow down the response time and are therefore not suitable for in- 
teractive data processing. Hence, our goal must be to find a way to substitute 
non-blocking data preparation steps for the blocking ones, or to avoid the latter 
altogether. The idea is to keep track of the already achieved preparation levels 
in the pipeline. By cleverly combining and reusing these preparation levels we 
hope to avoid many costly preparation steps and go right on to process the tuples 
in the non-blocking calculation step. The next section introduces the necessary 
concepts. 



4 Data Stream Quality 

The previous section served to provide an intuitive understanding of the prob- 
lem. What we need now is a systematic approach to describe the state of data 
preparation. We introduce data stream quality as the central concept. The pre- 
vious section already indicated some useful data qualities: sortedness, continuity, 
and the knowledge of cardinality or extremal values. Below we formalize these 
qualities and present theorems that capture useful relationships between the 
different qualities. 

4.1 Definitions 

To describe continuity and sortedness, we assume equality (denoted by d\ =Tu 
d 2 ) and (lexicographic) ordering on a pair of tuples (denoted by di <Tu <^ 2 ) with 
regard to a list of attributes Tjj, both with the standard meaning. In consequence, 
we also use <Tu in the usual manner. 

The most important data stream quality is the lexicographic order of the 
tuples in a data stream with regard to a list of attributes, for example resulting 
from the SORT operator: 

Definition 1. A data stream D = [di, . . . jcJm] data stream quality sorted 
ascendingly regarding a list Tjj of attributes, in characters S+([T[/]) iff 
i < j di <Tu dj for all i,j G {1, . . .,M}. 

Decreasing order (S“) can be defined analogously. 

The COLLECT operator produces a data stream where all tuples having the 
same Tjj values occur in a consecutive block, a property useful, e.g., for aggre- 
gation: 
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Definition 2. A data stream D = [di, . . . , ^m] has data stream quality conti- 
nuous regarding the list Tjj = [Au ^ , ■ • ■ , ^(/„] of attributes, written as C([Tjj]), 
iff for all i,j G {1, . . . ,M}, di = dj di =Tu dk for all k with i < k < j. 

As a special case of continuity, we have uniqueness. We call a data stream 
distinct, if no two tuples are equal regarding to Tjj. 

Definition 3. A data stream D = [di, . . . , c^m] has data stream quality dis- 
tinct regarding the list Tjj = [Ajj^, . . . ,Au^] of attributes, written as T)([Tij]), 
iff di =Tu dj i = j for all i,j G {1, . . • ,M}. 

In contrast to the above data stream qualities, the following ones are not 
related to the sequence of the tuples in the stream, but describe the data stream 
as a whole. 

The data stream quality achieved by the CDUNT-operator is the knowledge of 
the tuple count. It is independent from any list Tjj of attributes. 

Definition 4. A data stream D = [di , . . . , du] has data stream quality known 
tuple count, in characters num, iff M is known with the arrival of di . 

Similar to the knowledge of the number of tuples, the a-priori knowledge of 
extreme values increases the quality of a data stream. In contrast to the first 
two definitions, it is defined on individual attributes. 

Definition 5. A data stream D = [di, . . . , du] has data stream quality known 
attribute range regarding an attribute A/j. in characters min (A/j. J or max 
(AnJ, iff minAnXD) or maxA^X^d) is known with the arrival ofdi. 

4.2 Theorems 

The following theorems cover dependencies between the different data stream 
qualities. The proofs are straightforward. Tu denotes a list of attributes. 

Theorem 1. If a data stream is sorted regarding Tu, it is also sorted with regard 
to every prefix of Tjj . 

Theorem 2. If a data stream is sorted regarding Tu, it is also continuous with 
regard to Tu- 

Theorem 3. If a data stream is distinct regarding Tu, it is also distinct with 
regard to every permutation of Tu ■ 

Theorem 4. If a data stream is continuous regarding Tu, it is also continuous 
with regard to every permutation of Tu ■ 

Theorem 5. If a data stream is distinct regarding Tu, it is also distinct in an 
extension of this list by further attributes. 

Theorem 6. If a data stream is distinct regarding Tu, it is also continuous in 
Tu- 

Theorem 7. If a data stream is sorted in increasing(decreasing) order, its min- 
imum (maximum) is also known with the arrival of the first tuple. 
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4.3 Data Stream Quality Requirements and Transformations 

Given the data stream qualities, we now examine how different operator imple- 
mentations can exploit them. We concentrate - as an example - on the GROUP 
operator as a complex operator that is particularly important to KDD. For each 
implementation we present the input data stream quality that is at least nec- 
essary for the correct working of the implementation, and the change of data 
quality as a result of the transformation by the implementation. 

The GROUP operator has three implementations which differ in their mini- 
mally required input data stream quality, their memory requirements and their 
blocking capabilities. The traditional implementation is hashGroup, which uses 
a hash table to calculate the aggregation results for each combination of Tu. 

D{Tu),^S{*) 

^num 



hashGroup {Tu,f) 



For this implementation no data stream quality requirements are needed on 
input. Because each combination of Tjj values produces exactly one result tuple, 
the output stream is distinct with regard to Tjj. However, the number of tuples 
is not known beforehand. Furthermore, the hash function usually destroys all 
existing orders. The disadvantages of this implementation are the large amount 
of memory necessary to simultaneously store all results, and the blocking output 
(denoted by the bar near the end of the box), which restricts the use in a pipeline. 

If the data stream quality at the input is at least continuous with respect to 
the list of the grouping attributes, the blockGroup implementation can be used. 



C{Tu) 



blockGroup (T(7, /) 



^num 



It applies the aggregation function block by block and therefore does not need 
to store more than one block of continuous tuples at the same time. Therefore, 
the amount of required memory is usually far less than with the hashGroup 
implementation. Moreover, if the input is sorted according to a list of attributes, 
that property also holds for the output. The knowledge about minimum and 
maximum values for attributes which are not included in Tjj are lost along with 
these attributes. This implementation is delaying rather than blocking (denoted 
by the short bar at the end of the box) because the first result tuple cannot be 
output until the first block has been processed. 

The noopGroup is used if the data stream is already distinct in Tjj . Here, the 
aggregation function can be applied tuple by tuple without blocking or delaying. 

As a result, the different implementations show that with rising input data 
stream quality memory requirements decrease and pipelining capabilities in- 
crease. Analysis of further operators show that this principle hold for most of 
the complex operators of the KDD process. 



5 Providing Data Stream Qualities 

Data stream quality is a prerequisite for implementations that are resource- 
economical and non-blocking. However, the prior steps to produce the qualities 
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seem to be blocking and need many resources. In this section we look for an 
access method that can produce data streams with desired data stream qualities 
right off the disk. We will see that there is a trade-off between the efficiency of 
the index and the level of delivered data stream quality. Therefore, we introduce 
the notion of pseudo-quality that allows us to control the desired degree of data 
stream quality. 

5.1 The Index Structure 

Traditional clustering indices or combinations of several one-dimensional indices 
support only one or very few attributes and in general allow sequential reading 
in just one dimension. Our needs are different, though. We have to deal with 
mass data and do not know in advance which attributes will be the subject of in- 
terest during query evaluation. Therefore, in order to achieve equal treatment of 
multiple attributes and allow for sequential disk access, we choose a multidimen- 
sional index structure based on the Peano (Z) space filling curve [13] for physical 
clustering of the multidimensional data. The Peano curve was chosen over other 
mappings that promise somewhat better clustering properties (such as e.g. the 
Hilbert curve [6,8]) because of the efficient translation between operations in the 
one-dimensional Z-address space and operations in the multidimensional coordi- 
nate space by algorithms operating on the bit representation of the coordinates. 
The basic ideas are similar to the UB-tree [1] with its range query algorithm 
and TETRIS algorithm [12]: The space-filling curve is used to map the mul- 
tidimensional data tuples (represented as points in multidimensional space) to 
one-dimensional addresses reflecting the point’s position on this curve while pre- 
serving locality as far as possible. These addresses are then used to store the 
tuples using a conventional one-dimensional access method (such as in our case, 
a clustering B*-tree). The range-query algorithm for this data structure works 
as illustrated in Figure 2a: The segments of the space filling curves lying in the 
query box (shaded) are calculated and the data belonging to these segments can 
then be read sequentially from disk using the one-dimensional addresses of the 
entry/exit points of the curve with the query box. The small numbers in the 
figures denote the order in which these curve segments are processed. For this 
range query, we see that four segments are processed, which typically requires 
four random accesses. 

In our work, we extended the UB-tree algorithms by some optimizations for 
sequential reading in a clustering index and the support for various data stream 
qualities in arbitrary dimensions. To deliver the data satisfying a particular 
range query according to a specified data stream quality, e.g. sorted, our algo- 
rithm works as follows. To provide the sort order, the query box is segmented 
into regions which are processed region by region in the desired sorting order. 
For each region, the standard range query algorithm is used. Figure 2b shows 
how the range query results of Figure 2a can be presented in sorted order along 
dimension y (data stream quality S~''([y])). In order to deliver the data in this 
order, the index splits the query box into fiat regions of height 1 and processes 
them one after another. As we can see by counting the corresponding line seg- 




Interactivity, Scalability and Resource Control 183 



^5 




















i-s 




\i_\ 






\ 








NIS 


\ 






5; 


-W 







X 


X 


X 




X 






v_ 




VX! 


— 1 


v_ 


X 


hr 






n 




X 


X 




Xt 




it 




V_ 


X 



y 



X^ 


X 


X 


X 


“X 


X 












x:^ 


v_ 


X 













































Fig. 2. Range query, sorted, pseudo-sorted 

ments, this approach is not very efficient: In our simplified example, 18 segments 
are processed (and correspondingly 18 index accesses are performed). 

The solution to this problem is shown in Figure 2c: By a controlled increase in 
the block size of the segmentation, we are able to improve the index performance 
at the expense of a controlled degradation of our data stream quality criteria. 
The resulting data stream can be seen as a concatenation of blocks. Data stream 
quality criteria (in our case sort order) are only valid between these blocks, but 
within these blocks, the values appear in arbitrary order. We call such degraded 
data stream qualities pseudo- qualities. These qualities will be defined in the 
following section. Note that the number of different values per block is limited 
by the block size. In our example, we can see that by allowing the index to 
deliver several (in our example 2) values of the sorting attribute per block) the 
number of index accesses can be reduced to 6. 

This approach can easily be generalized to higher dimensions. Instead of two- 
dimensional ranges, we create higher dimensional boxes which are processed one 
after another. 

In the case of continuity we take a similar approach. The difference is that 
the continuity property does not impose a special sort order. It simply requests 
that no value combination in the grouping attributes appears in more than one 
block, so there are more degrees of freedom to the shape of those blocks in 
multidimensional space. This is exploited by our index to reduce the number of 
entry/exit points with the space filling curve, rendering the index more efficient, 
especially for higher dimensions. 

5.2 Pseudo Qualities 

In the last section we have seen what data stream qualities look like if we extend 
our UB-index to tolerate slight degradations from our strict quality criteria for 
the sake of efficiency. We include these in our formal framework by adding the 
following definitions. 

Definition 6. [Pseudo-Sorting] 

A data stream Dji has the data stream quality pseudo-sorted^ ascend- 
ingly with respect to Tu and k, in short . . . ,Au^]), iff there 
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exists a partitioning of into consecutive streams Dr = D\o . . . o Di,, 
with 

1. for all g G : \{IlTu{di) \ di G Zlg}| < k 

2. for all g,h G {1, . . . , 6} with g < h and for all d G Dg and e G Dh •' 
d <Tjj e 

In other words, a data stream is pseudo-sorted if it can be partitioned into 
consecutive blocks such that each block has at most k distinct values of the 
sort key (in arbitrary order) and the key ranges of successive blocks form a 
non-overlapping ascending sequence of ranges. 

Similar to the above definition, we define pseudo-continuity. Here it suffices 
to require that each value combination appears in exactly one block: 

Definition 7. [Pseudo-Continuity] 

A data stream Dr has the data stream quality pseudo-continuousfc 
with respect to attribute list Tjj and k, in short PCfc([H[/i , . . . , AuJ\), iff 
there exists a partitioning of into consecutive streams = D\ o 
. . . o £){,, with 

1. for all g G {!,..., 6}.- \{IlTu{di) \ di G Dg}\ < k 

2. for all g,h G {1, . . . , 6} with g ^ h and for all d G Dg and e G Dh : 
d 7^T[/ e 

With these definitions we can now formulate the following theorems. Tu 
denotes a list of attributes. Theorems 8 to 10 show that sorting is a special case 
of pseudo-sorting {k = 1). 

Theorem 8. If a data stream is sorted with respect to Tjj, then it is also pseudo- 
sortedk for every k. 

Theorem 9. A data stream is sorted with respect to Tjj, iff it is pseudo-sortedi 
with respect to Tjj. 

Theorem 10. If a data stream is pseudo-continuousk (pseudo-sortedk) with 
respect to Tjj, then it also pseudo- continuous (pseudo-sorted) for every 
multiple of k. 

Theorem 11. If a data stream is pseudo-sortedk with respect to Tu then it also 
is pseudo-continuousk- 

Theorem 12. If a data stream is continuous with respect to Tjj, then it is 
pseudo-continuousk for every k. 

Theorem 13. A data stream is continuous with respect to Tjj, iff it is pseudo- 
continuousi with respect to Tjj. 

Theorem 14. If a data stream is pseudo-continuouSk with respect to Tu, then 
it is pseudo- continuousk with respect to every permutation ofTu- 
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5.3 Operators Dealing with Pseudo- Qualities 



Since most of the implementations for the calculation steps cannot make direct 
use of the pseudo-qualities described in the previous section, we still need prepa- 
ration operators in order to transform pseudo data qualities into strong data 
qualities. These operators work on each block in the data stream (the index in- 
serts markers into the data stream to delimit individual blocks) and release the 
result with the desired quality to the next operator. Our goal is to replace the 
blocking operators SORT and COLLECT by non-blocking operators that transform 
pseudo-sorted and pseudo-continuous data streams to sorted and continuous 
data streams, respectively. 

For the first case, we introduce the k-Sort operator. This operator is a spe- 
cialization of SORT for pseudo-sorted inputs, and respects the block boundaries: 
After sorting an entire block with a maximum of k different values for the sort 
key, the operator knows that all tuples in the following blocks are greater with re- 
spect to the sort attributes, and therefore it can release the result and clear its in- 
ternal data structures. In our notation, the k-Sort operator can be characterized 
as shown below. Note that in general, all data stream qualities are lost with the 
exception of those that refer to a prefix or an extension of sort attribute list Tu- 



PSfc(Tc/) 



k-Sort (Tjj) 



S(Tc/) 

.S(X), -C(Y) 



X refers to all attribute sequences that are neither a prefix of Tjj nor an 
extension of Tjj. Y denotes all attribute lists that are neither a prefix nor a 
superset of Tjj. 

Similarly, k-Collect transforms pseudo-continuous data streams into con- 
tinuous streams. After collecting the values of an entire block (e.g., in a hash 
table) and reading a block boundary, it can be certain that none of the value 
combinations in Tjj seen in this block will ever reappear in subsequent blocks. 
Hence, it can release the continuous result and clear its internal data structures. 
The operator is shown below in our notation. Here in general, all data stream 
qualities involving attribute lists different from Tjj are lost. 



PCfe(Tc/) 



k-Collect {Tu) 



C{Tu) 
S{*), -C(X) 



X refers to all attribute lists that contain only a subset of the attributes from 
Tu- 

With our specialized index and the operators described above, it is now pos- 
sible to provide all desired data stream qualities in a non-blocking and resource- 
efficient way. 



6 Parallel and Interactive Preprocessing 

6.1 Basic Example 

We have to demonstrate the positive effects of our approach in query execution. 
Consider the following simple query that computes from a table of account move- 
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merits the average amount each customer (denoted by FirstName and LastName) 
was credited in each transaction. 

SELECT FirstName, LastName, AVG(sum) as meanCash 

FROM Accounting 

GROUP BY FirstName, LastName 

ORDER BY LastName, meanCash 

The resulting operator tree is displayed in Figure 3. To be able to use an effi- 
cient blockGroup implementation for the GROUP BY that processes blocks 
of continuous values, the data stream should be continuous with respect to 
[FirstName, LastName] (in short, C([F,L])). For the Sort operator, the data 
stream should either be sorted by LastName (then we would use the blockSort 
implementation that sorts blocks with the same value in its first attribute), or 
pseudo-sorted (then we would use the new k-Sort implementation). In our case, 
it turns out that PS([L,F]) is the optimum configuration for the index access: 
Because of Theorem 11 we know that this also means PC([L,Fj), which can 
be used by the k-collect operator to provide continuity to the blockGroup 
implementation. The output of this operator is reduced in cardinality, but re- 
mains pseudo-sorted in ([L,Fj. It can therefore be sorted by the k-Sort operator 
to provide strict sorting in LastName to the blockSort operator, which now 
produces the desired order. 




Fig. 3. Pipeline for Example Query 

This simple example shows the power of our framework: The index struc- 
ture can efficiently provide initial data stream qualities, the k-implementations 
of data preparation operators can be non-blocking, because the block markers 
in the pseudo-ordered data stream allow them to release their actual results and 
proceed with the next block, resource consumption can be effectively controlled 
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by varying the block size k, data preparation operators can be inserted where 
they are really needed (e.g., be placed after initial selections), and the delivered 
data stream qualities (that formerly existed only within individual implemen- 
tations of complex operators) are made explicit and can be shared by several 
operators. 

6.2 Experiments 

We have implemented our extended algebra with data stream qualities in Java- 
Party [14] on our parallel platform. In our first experiments with complex queries 
on the 1 GB-TPC-D data set, a pipelined implementation of the algebra could 
indeed reduce response times by a factor of up to 20. Against pure in- memory 
execution without data stream qualities, the improvements in response entail a 
penalty of less than 30% in overall execution time. For datasets that do not fit 
into main memory (which is the more realistic assumption, especially in mul- 
tiuser environments) conventional execution has to swap intermediate results to 
disk. In this case there is barely any penalty, instead, we even get better over- 
all execution times because of the possibility of effective control over resource 
consumption. These results demonstrate that we achieve interactive query pro- 
cessing and at the same time enhance scalability and efficiency. 



7 Parallel and Interactive Data Mining 

After having gained some insight into preprocessing, we now look at how this 
approach can be combined with data mining functionality to achieve parallel 
and interactive data mining. As an example we look at the problem of finding 
distance-based outliers. 

7.1 Distance-Based Outliers 

We define distance-based outliers in accordance with [10]. A data tuple is called 
a (p,D)-outlier if at least fraction p of the tuples in the dataset are beyond 
a distance of D from the observed tuple. A simple operator to calculate these 
outliers would be the nested-loop operator: For every tuple, the distances from 
this tuple to all the others are calculated and those lying within distance D are 
counted. If we get over the threshold defined by p, the tuple is no outlier and the 
next tuple is analyzed. Clearly, this algorithm is not efficient and only usable if 
the whole data fits in memory because index accesses to find neighboring tuples 
would make things only worse. In [10] the problem is alleviated by several block- 
and cell based approaches aiming to reduce disk accesses and complexity. 

Our data stream qualities allow us to take a different approach: If the data 
stream has the quality sorted according to some attribute, we are able to ex- 
ploit this quality to increase interactivity and save resources. The corresponding 
windowDutlier operator works as shown in figure 4. On the data stream ordered 
by attribute Ajji, we define a window of size D, which is shifted over this stream 
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Fig. 4. Pipelining with the windowOutlier operator 



as data flows through the operator. As a tuple enters the window (e.g., tuple 7) 
it is compared to all others within the window. Each time the distance between 
two tuples is lower than D, each tuple’s counter is increased. As a tuple leaves 
this window (as in this case, tuple 3), it is marked as an outlier if its counter is 
below the threshold. 

We see that this implementation is not blocking and resource efflcient because 
only the tuples within D need to be stored in memory and compared at a time. 
Therefore we can add the following implementations for finding outliers with 
respect to a set Tjj of attributes to our set of algorithms: 



standardOutlier (Tu) 



num, S{Au) 



windowOutlier (Ty) 



S{Au) 



The StandardOutlier implementation needs no specific data stream qual- 
ity but is blocking. The windowOutlier implementation needs the data stream 
quality sorted on some Attribute Au with Ajj € Tjj and the knowledge of 
the number of incoming tuples in order to deliver outliers in a non-blocking way. 
Note that this implementation preserves the quality sorted in the resulting data 
stream. 



7.2 Evaluation 

We implemented both variants of the operator in our platform and generated 
a dataset of 10000 tuples to measure the effects. The data consists of seven 
attributes of the following types, cardinalities and distributions: 
key (Integer) 

inti 0-99, evenly distributed, 2 Tuples at 200, 300, 400 

int2 0-9, evenly distributed, 3 tuples at 20, 30, 40 

floatl 0,08 - 9999,75, evenly distributed 

float2 -36,58 - 44,52, Gaussian distribution 

Stringl 1-5 Characters, evenly distributed 

String2 3 Characters, Names of the German federal states 
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Time(s) 



Fig. 5. Execution Times for standard and windowed Outlier operator 

Figure 5 displays the results comparing the standardOutlier implemen- 
tation and the windowDutlier implementation. For both implementations we 
tested the following attribute combinations: 

— the inti attribute (1 Attr), 

— the inti attribute normalized (1 Attr Norm) 

— the inti, int2, Floatl attribute set (3 Attr. Norm) and 

~ the inti, int2, Floatl, Float2, Stringl, String2 attribute set 
(6 Attr. Norm) 

with a percentage p of 0.9 and a distance D of 70. The bars show the response 
times (i.e., the time until the first result is produced) and the remaining overall 
execution time (which is equal to the response time for the standardOutlier 
implementation, of course). We used inti as the sorting attribute (e.g. the di- 
mension in which the window is moved) in the windowOutlier implementation. 
We see that the windowOutlier implementation shows substantial improvements 
in response time. In most cases, the first results are presented almost immedi- 
ately. The total execution time is slightly better as well for the windowOutlier 
implementation. That means that the additional work for sorting the input 
by means of our index structure and an additional sort operator is more than 
compensated for by the reduced number of comparisons of the windowOutlier 
operator. 

In order to further investigate the effect of the window size on overall exe- 
cution time for the windowOutlier implementation, we measured the execution 
times for different window sizes D. Figure 6 shows the execution times for the 
normalized attribute sets used above. We see that the implementation can use 
the smaller window sizes to significantly reduce execution time. For a window 
size of 10, execution time can be reduced by up to 80% compared to the max- 
imum execution times occurring for window sizes of 100 and greater^. For the 



^ Because we again used inti as our sorting attribute, there is no effect of window 
sizes greater than 100 because nearly all tuples fit into one window. 
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Fig. 6. Different window sizes for the windowOutlier operator 



standardOutlier implementation (not displayed here), execution times stay 
nearly constant for all window sizes. The average time is about 10% higher than 
the displayed maximum execution time of the windowOutlier variant. The re- 
sponse times in this experiment stayed under 6% of the overall execution time 
throughout all window sizes. 

Because our multidimensional index structure allows the use of every index 
attribute for sorting, the third experiment shows the influence of the sorting at- 
tribute on execution times. We compare the response time and overall execution 
time for the detection of six-attribute-outliers with each of the six attributes 
chosen as sort attribute. The results (see figure 7) are quite interesting: Whereas 
the results for the two integer attributes are nearly identical (because of their 
nearly identical normalized cardinality distribution), the floatl attribute per- 
forms much better, whereas the float2 attribute lies between the floatl result 
and the integer results. The reason is that in contrast to the inti attribute, 
the floatl attribute is equally distributed throughout the normalized attribute 
range, so there are fewer tuples in each window. The f loat2 attribute has simi- 
lar properties, but its Gaussian distribution leads to a degradation in execution 
time. The explanation for the high response time for the String attributes is 
that the relatively high window size (again 70) combined with the chosen map- 
ping of Strings into the normalized domain place nearly all the tuples within 
one window, so the windowOutlier degenerates to a standardOutlier. This 
result shows the utmost importance of the choice of the right sort attribute 
especially for high distance values D. It is the task of the query optimizer to 
choose an attribute with sufficiently high cardinality and well-balanced distri- 
bution within the attribute range in order to render the implementation inter- 
active and efficient. The necessary information can be obtained from optimizer 
statistics. 

These experiments demonstrate that the concept of using data stream quali- 
ties to improve interactivity and efficiency holds promise for a seamless extension 
from preprocessing to data mining tasks. The multidimensional index structure 
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Inti Int2 Floall Float2 Stringl S1ring2 



Sorting Attribute 

Fig. 7. The influence of the sorting attribute 

allows to deliver several sort orders efficiently. The optimizer can then choose 
the optimum data stream qualities for the operators. Interactive and resource- 
efficient processing of whole streams of preprocessing and data mining operators 
thus became an achievable reality. 

7.3 Enhancing Scalability 

The experiments demonstrated that the Outlier Operator now is interactive, 
but still rather expensive. In order to further enhance scalability, the approach 
described above can be improved by cascading several windowOutlier imple- 
mentations into a pipeline: By using the fact that 

0(p, D') 2 0{p, D) for D' <D 

indicating that an outlier with respect to a given distance D is always an outlier 
for each smaller distance D’ as well, the pipeline starts with small window sizes 
(and therefore fewer comparisons) marking all potential outliers for this window 
size. Only these potential outliers have to be tested in the subsequent steps on 
whether their outlier property still holds for larger window sizes as well. In effect, 
the early stages in this pipeline work with higher throughput because of the small 
window sizes, and the later stages are more efficient as well because they have 
fewer potential outliers to observe. In this way we are able to distribute the work 
across several implementations and, by balancing the load within this pipeline, 
increase overall pipeline throughput. 

8 Conclusion 

We set out to improve efficiency, scalability and interactivity in KDD. The first 
idea was to isolate the blocking, resource-intensive operators. We identified the 
data preparation operators as the ones that tend to interrupt the continuity 
of a data stream. By introducing the notion of data stream quality and by 
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employing advanced index structures that support an algebra of data-stream- 
quality-aware operators we were able to demonstrate substantial improvements 
in sequential and parallel query execution. The results gained with the Outlier 
operator demonstrate that this not only holds for preprocessing but can be 
extended to parallel and interactive data mining as well. 

These results together with the possibility to control resource consumption 
at optimization time seem to open a wide area of applications for our techniques 
beyond KDD. Such applications may range from the integration of analysis func- 
tionality into mainstream commercial databases all the way across embedded 
databases, up to pipelined distributed query processing in Grid Computing. 
Still, we feel we just scratched the surface. We plan to extend our work towards 
an extensive investigation of the combination of our index structure with vari- 
ous data stream quality levels. In addition we plan to further enhance scalability 
making use of replicated input data by means of our index structure. Only then 
do we feel confident enough to tackle the issue of interleaving more complex and 
multi-stage data mining algorithms with our improved algebra. 
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Abstract. Algorithms for finding frequent itemsets fall into two broad 
categories: algorithms that are based on non-trivial SQL statements 
to query and update a database, and algorithms that employ sophis- 
ticated in-memory data structures, where the data is stored in flat files. 
Most performance experiments have shown that SQL-based approaches 
are inferior to main-memory algorithms. However, the current trend of 
database vendors to integrate analysis functionalities into their query 
execution and optimization components, i.e., “closer to the data,” sug- 
gests to revisit these results and to search for new, potentially better 
solutions. 

We investigate approaches based on SQL-92 and present a new ap- 
proach called Quiver that employs universal and existential quantifica- 
tions. In the table schema for itemsets of our approach, a group of tuples 
represents a single itemset. Such a “vertical” layout is similar to the 
popular layout used for the transaction table, which is the input of fre- 
quent itemset discovery. We show that current DBMS do not provide 
efficient query processing strategies for dealing with quantified queries, 
mostly due to the lack of an adequate SQL syntax for set containment 
tests. Performance tests using a query processor prototype and a novel 
query operator, called set containment division, promise an improved 
performance for quantified queries like those used for Quiver. 



1 Introduction 

The discovery of frequent itemsets is a computationally expensive preprocess- 
ing step for association rule discovery [1], which finds rules in large transac- 
tional data sets. Frequent itemsets are combinations of items that appear fre- 
quently together in a given set of transactions. Association rules characterize, 
for example, the purchase pattern of retail customers or the click pattern of 
web site visitors. Such information can be used to improve marketing cam- 
paigns, retailer store layouts, or the design of a web site’s contents and hyperlink 
structure. 

Most commercial data mining systems and research prototypes employ algo- 
rithms that run on data stored in flat files. However, database vendors begin to 
act against the general lack of mining functionality to support business intelli- 
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gence and integrate new “primitives” into their systems [2] . Companies employ- 
ing data mining tools for their business [3] realize the need for integrating data 
mining algorithms with DBMS. 

Some authors in research as well as in industry have suggested special 
data mining languages like MSQL [4], DMQL [5] or a variant of DMQL [6], 
and ATLaS [7]. Others have enriched query languages with mining function- 
ality, like the MINE RULE operator [8], OLE DB for DM [9]. However, the 
query processing power offered by modern database systems for mining pur- 
poses has been widely neglected in the past. Some research results show that al- 
gorithms for frequent itemset discovery based on SQL are less efficient than those 
based on sophisticated in-memory data structures [10]. Others claim that “even 
SQL as is is adequate even for complex data mining queries and algorithms” [11]. 
Nevertheless, it becomes ever more important for database system vendors 
to offer novel analytic functionalities to support business intelligence appli- 
cations. 

In this paper, we analyze several approaches to compute frequent itemsets 
using SQL. We also propose a new SQL-based approach and compare it to the 
other approaches. 

1.1 The Problem of Frequent Itemset Discovery 

We briefly introduce the widely established terminology relevant for frequent 
itemset discovery. An item is an object of analytic interest, like a product of a 
shop or a URL of a document on a web site. An itemset is a set of items and a 
k-itemset contains k items. A transaction is an itemset representing a fact, like 
a purchase of products or a collection of documents requested by a user during 
a web site visit. 

Given a set of transactions, the frequent itemset discovery problem is to find 
itemsets within the transactions that appear at least as frequently as a given 
threshold, called minimum support. For example, a user can define that an item- 
set is frequent if it appears in at least 2% of all transactions. 

Almost all itemset discovery algorithms consist of a sequence of steps that 
proceed in a bottom-up manner. The result of the A:th step is the set of frequent 
fc-itemsets, denoted as Fk- The first step computes the set of frequent items 
(1-itemsets). Each following step consists of two phases: First, the candidate 
generation phase computes a set of potential frequent fc-itemsets from Fk-i- 
The new set is called Ck, the set of candidate fc-itemsets. It is a superset of F^. 
Second, the support counting phase filters out those itemsets from Ck that appear 
more frequently in the given set of transactions than the minimum support and 
stores them in Fk- 

All known SQL-based algorithms follow this “classical” two-phase approach. 
There are other, non-SQL-based approaches, like frequent-pattern growth, which 
do not require a candidate generation phase [12]. The frequent-pattern growth al- 
gorithm, however, employs a (relatively complex) main-memory data structure, 
called frequent-pattern tree, which disqualifies it for a straightforward compari- 
son with SQL-based algorithms. 
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1.2 Motivation for a New SQL-Based Approach 

The key problem in frequent itemset discovery is: “How many transactions con- 
tain a certain given itemset?” This question can be answered in relational algebra 
using the division operator (-^), introduced in [13]. Suppose that we have a rela- 
tion Transaction{transaction, item) containing a set of transactions and a rela- 
tion Itemset {item) containing a single itemset, i.e., each tuple contains one item. 
We want to collect those transaction values in a relation Contains {transaction) , 
where for all tuples in Itemset there is a corresponding tuple in Transaction that 
has a matching item value together with that transaction. In relational algebra, 
this problem can be stated as Transaction{transaction, item) Itemset{item) = 
Contains{ transaction) . 

The example in Figure 1 illustrates the division operation. The Transaction 
table consists of three transactions and two of them contain all items of Itemset. 
We simply have to count the number of tuples in Contains to decide if the 
itemset is frequent. Using division terminology. Transaction plays the role of the 
dividend, Itemset represents the divisor, and Contains is the quotient. 



transaction 
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1001 
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1001 


chips 
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chips 
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\transaction] 
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Fig. 1. Example showing the relationship between frequent itemset discovery 
and relational division: Transaction -b Itemset = Contains 



Unfortunately, frequent itemset discovery poses the additional problem that 
we have to check many (candidate) itemsets for sufficient frequency, i.e., unlike 
in Figure 1, we usually do not have a constant divisor relation but we need many 
divisor relations. 

We are not aware of any commercial database system that has implemented 
a division operator. One reason for this is that division is not a basic operator, 
i.e., it can be expressed by other operators of the relational algebra. Another 
reason is that it is difficult for a system to detect a division problem inside a 
SQL query because the language offers no dedicated syntax for division. Division 
problems within queries are often simulated either by indirect formulations using 
counting or by using a “NOT EXISTS” clause. Nevertheless, in theory there 
are several efficient algorithms for division [14,15]. For example, suppose that 
the divisor Itemset is sorted on item, and the dividend Transaction is grouped 
on transaction (as in Figure 1) and for each group, it is sorted on item in the 
same order as the dividend. Then we can employ an algorithm similar to sort- 
merge join that computes the quotient in one scan over the dividend and divisor 
tables. 
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The input data of the division may have such a data property for one of two 
reasons. First, the data was materialized in this way in tables on disk (clustered 
index). Second, the data is an intermediate result of another operator within a 
larger query. For example, a sort-merge join could have produced the dividend 
from two other input tables. 

Based on the idea of using division to specify the itemset containment prob- 
lem, we devised a complete algorithm in SQL using a vertical table layout and 
universal quantifications. We use the term “universal quantification” instead of 
“division” because we will first specify the queries of our approach using tuple 
relational calculus, which includes the mathematical universal quantifier (V). In 
addition, we will show SQL queries that are equivalent to the calculus expres- 
sions. 

Although there are two main approaches to express universal quantification 
in SQL, as mentioned before, one based on counting, the other based on value 
comparisons, we focused on the latter approach. The reason is that the counting 
approach puts extra restrictions to the quantification problem and it is less 
intuitive. To see why, suppose we want to test if an itemset i is contained in a 
transaction t. The counting approach compares the number of items in i with 
those in t. This comparison makes sense only if we require that t contains only 
items that are contained in i. In other words, we have to remove those items 
from t in a preprocessing step that are not contained in i. After that, we count 
the number of items in i and t and find that i C t if the numbers are equal. The 
SQL statements based on value comparisons that will be explained later in this 
paper are a more intuitive formulation of the quantification problem. 

Note that the division operator is closely related to set containment joins, 
which can be implemented very efficiently, as shown for example in [16]. We 
currently investigate the relationship between these two classes of algorithms 
in detail as ongoing work [17]. A set containment join (nc) is a join between 
set- valued attributes of two relations R and S: R Mq S = {{s,t)\R M S'Js C t}. 
Hence, the items of each transaction or itemset would be represented by a set. 
The focus of this paper lies on algorithms based on SQL-92, which does not cover 
set-valued attributes. We have recently proposed a generalization of division, 
called set containment division, where the divisor is defined by a group of tuples. 
For example, the divisor relation Itemset {itemset, item) defines a divisor for 
each value of itemset. The result is the union of division operations between the 
dividend relation and a single group of the divisor relation [14]. This operator 
is equivalent to set containment join but processes relations in the first normal 
form. 

The rest of this paper is organized as follows. In Section 2, we discuss alter- 
native ways to store and process data in tables of a relational database system. 
Section 3 highlights important known approaches using SQL-92 before we in- 
troduce a new approach that makes use of a vertical table layout in Section 4. 
We compare the performance of several algorithms discussed before and assess 
a novel query processing technique, set containment division, for the new algo- 
rithm in Section 5. Section 6 concludes the paper. 
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2 Table Layout 

Before data can be mined with SQL, it has to be made available as relational 
tables. Typically, the data is stored in tables within that database system or, 
if the data has to be kept outside of the database system, it has to be made 
accessible to the database system by using wrappers that provide a relational 
view on the data. 

2.1 Layout Types 

Two types of data objects are relevant for frequent itemset discovery: transac- 
tions and itemsets. For each type, there are basically two main layouts (schemas) 
for representing these objects in a table. In particular, the items of an object 
can be stored either in a single row, which we call a horizontal layout, or in sev- 
eral rows, which we call vertical layout, as illustrated in Table 1. Note that the 
horizontal layout for itemsets has a position attribute pos associated with each 
item. This is necessary because most algorithms assume a lexicographic order of 
items within an itemset and they need to access an item at a specific position. 



Table 1. Table layout alternatives for storing the items of transactions and 
itemsets 



Layout 



Transactions 



Itemsets 



horizontal 

(single-row/multi-column) 



transaction 


itemi 


item2 


item^ 


1001 


diapers 


beer 


chips 


1002 


chips 


diapers 


NULL 



itemset 


itemi 


item2 


101 


beer 


chips 


102 


beer 


diapers 



vertical 

(multi-row/single-column) 



transaction 


item 


1001 


diapers 


1001 


beer 


1001 


chips 


1002 


chips 


1002 


diapers 



itemset 


pos 


item 


101 


1 


beer 


101 


2 


chips 


102 


1 


beer 


102 


2 


diapers 



Almost all known SQL-based approaches assume that transactions are stored 
in a vertical layout. To the best of our knowledge, only the approach proposed by 
Rajamani et al. [18], assumes a horizontal layout. In that approach, the horizon- 
tal/vertical layout for transactions is called multi-column/single-column data 
model, respectively. No layout alternatives for itemsets are discussed in that 
paper because the focus is on input data (i.e., only the transactions) for asso- 
ciation rule discovery algorithms, not on intermediate or result data (itemsets) 
representations, as in this paper. 

Analogous to transactions, there are two different table layouts for itemsets. 
All known approaches for frequent itemset discovery based on SQL-92 assume a 
horizontal itemset table layout. 

A third, hybrid way of storing items is possible, combining the vertical and 
horizontal approaches. It may happen that the size of itemsets or, more likely, 
the size of some transactions is larger than the number of item attributes that 
have been defined for the tables of a horizontal layout. For example, if 99% of 
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all transactions to be stored in a database have up to ten items but only 1% 
has ten items or more then a database designer may decide that a horizontal 
layout with ten item attributes is reasonable. For the few long transactions, 
however, the remaining items can be stored in additional rows. For example, 
if the transaction table layout has ten item attributes and we want to store 
an itemset of size 33, then at least four rows are required to store the entire 
transaction. Of course, all rows belonging to the same transaction/itemset must 
have a common transaction/itemset ID, as with the vertical approach. We will 
not further discuss this approach in this paper. 

Further, rather exotic approaches have been proposed to represent trans- 
action data. For example, the decomposed data structure described in [19] is 
a non-relational data layout, called decomposed storage structure, used in the 
database system Monet where the transactions are stored as follows. Let I de- 
note the number of distinct items in the transactions. The data are stored in / 
columns(!) where each column contains the set of transaction IDs that contain 
the item. Hence, the frequent 1-itemsets are simply the columns that contain 
a sufficient number of transaction IDs. Other layouts are used by SQL-based 
algorithms with object-relational features like user-defined functions, as defined 
in SQL: 1999. These layouts have a column with a container data type (like 
BLOB or VARCHAR) to store lists of objects. One example approach, called 
Vertical [10], uses a transaction table layout Transaction{transaction,itemJist), 
where itemJist contains the items of a transaction. A similar approach, called 
Horizontal [10], uses the layout Trans action{item, transaction-list) , similar to the 
decomposed structure described above. For each distinct item, there is a list of 
all transaction IDs that contain the item. However, the Horizontal approach uses 
a row instead of a column for each item, as in the decomposed layout. In this 
paper, we restrict our discussion to approaches based on SQL-92, i.e., we focus 
on the vertical and horizontal layout. 

2.2 Vertical vs. Horizontal Layout 

In the following, we will use the term object to denote itemsets and transactions 
alike. The vertical approach differs from the horizontal approach in several ways, 
like the maximum object size and the number of tables and indexes used for the 
objects. 

Object Size. The size of an object does not need to be specified in the vertical 
layout. If we want to store extremely large objects using a horizontal layout, 
it could happen that the maximum number of table columns allowed in the 
database system in use is lower than the desired object size.^ Not only the 
storage of objects may be restricted in a horizontal layout but also the processing 
of queries may cause problems. The number of attributes allowed in a SELECT 
clause of SQL may also be lower than required by an SQL-based algorithm. 



^ For example, Microsoft SQL Server 2000 allows a maximum of 1,024 columns per 
table. IBM DB2 Universal Database 7.2 allows up to 1,012 columns per table. 
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Therefore, we have to take care of the fact that we should also avoid very long 
attribute lists in projections, i.e., the SQL-based approach should not produce 
an intermediate result that has a horizontal layout inside the queries even if the 
outcome of a query is in a vertical layout. 

Number of Tables. Objects of different size can be stored in the same table 
if a vertical layout is used. We could even store all objects in a single table, i.e., 
the transaction table, provided that the data types of itemset IDs and transac- 
tion IDs are compatible, that the IDs of transactions and itemsets are unique, 
and that the position attribute values are set according to the lexicographical 
ordering of items for transactions. In a vertical approach, the support coun- 
ters of frequent itemsets can be kept in a separate table Support{itemset, count), 
where the itemset value corresponds to that of the respective itemset table, for 
example by means of a foreign key Support.itemset referencing table Itemset. 
In a horizontal layout, the support counter can also be added as an additional 
attribute to the Itemset table: {itemset, itemi, ... ,itemn, count). It is reported 
that the horizontal layout for transactions seems to allow faster algorithms [18]. 
However, the vertical layout is much more common for market basket analysis, 
the most popular field of application for association rule discovery. 

Number of Indexes. Fewer indexes come into consideration and are required 
to improve performance in the vertical layout. An itemset table in a vertical 
layout has three attributes. Hence, only 15 column combinations for indexes 
are possible^. The larger number of potential indexes for the horizontal layout 
requires a more thorough analysis on which subset of indexes could actually be 
exploited by the queries given the current characteristics of data to be mined. 

3 SQL-Based Algorithms 

There is a multitude of algorithms for frequent itemset discovery. Most ap- 
proaches do not consider the query functionality of a database system but merely 
its storage capability, if at all. Implementations of these approaches, including 
commercial mining systems, use a database system like a file system for retrieving 
input transaction data and in rare cases also for storing intermediate and result 
itemsets. The focus of most research on new algorithms lies on main-memory 
data structures that allow an efficient candidate generation and support count- 
ing phase. Such algorithms have to provide scalability in addition to the core 
functionality itself. In contrast, SQL-based approaches can rely on the query 
execution engine to handle a scalable processing of the queries. However, they 
often lack the efficiency of main-memory based approaches. 



^ There are n! M combinations for n attributes. This is at least exponential, 

since 2" — 1 < n! X]fc=o M ^ Here, we do not take into account the types 

of indexes, like clustered, secondary, bitmap, etc., but we focus on the attribute 
combinations only. 
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Even the subclass of algorithms that use SQL queries is large. A couple of ap- 
proaches employ queries containing user-defined functions, which are processed 
by an object-relational database system. Furthermore, several approaches do not 
employ any user-defined procedural code at all. Such algorithms use only queries 
that conform to the SQL-92 standard. In this paper, we focus on approaches 
based on SQL-92. 

3.1 Overview of Algorithms Based on SQL-92 

The SETM algorithm is the first SQL-based approach [20] described in the liter- 
ature. Several researchers have suggested improvements of SETM. For example, 
in [21] the use of views is suggested instead of some of the tables employed, as 
well as a reformulation using subqueries. The performance of SETM on a parallel 
DBMS has been studied further [22]. The results have shown that SETM does 
not perform well on large data sets and new approaches have been devised, like 
for example K- Way- Join, Three- Way- Join, Subquery, and Two-Group-Bys [10]. 
These new algorithms differ only in the statements used for support counting. 
They use the same SQL statement for generating Ck, as illustrated in Figure 2 
for the example value k = A. The statement creates a new candidate fc-itemset by 
exploiting the fact that all of its k subsets of size fc — 1 have to be frequent. This 
condition is called Apriori property. It was originally introduced in the Apriori 
algorithm [23,24]. Two frequent subsets are picked to construct a new candidate. 
These itemsets must have the same items from position 1 up to fc — 1. The new 
candidate is further constructed by adding the A:th items of both itemsets in a 
lexicographically ascending order. In addition, the statement checks if the k — 2 
remaining subsets of the new candidates are frequent as well. 

The algorithms presented in [10] perform differently for different data char- 
acteristics. Subquery is reported to be the best algorithm overall compared to 
the other approaches based on SQL-92. The reason is that it exploits common 
prefixes between candidate /c-itemsets when counting the support. 

Another approach presented in [10] called K-Way-Join uses k instances of 
the transaction table and joins it k times with itself and with a single instance of 
Ck- An example SQL statement of this approach is shown on the left of Figure 5, 
where it is contrasted with an equivalent approach using a vertical layout on the 
right. 



INSERT INTO C4 {itemset, iteml, item2, item3, item4) 
SELECT newidO , iteml, item2, item3, item4 
FROM ( 

SELECT al. iteml, al.item2, al.itemS, a2.item3 
FROM F3 AS al , F3 AS a2, F3 AS a3, F3 AS a4 
WHERE al . iteml = a2 . iteml AND 
al . item2 = a2 . item2 AND 
al . items < a2 . itemS AND 
— Apriori property. 

— Skip iteml. 

a3 . iteml = al . item2 AND 

a3 . item2 = al . itemS AND 

a3 . items = a2 . itemS AND 

— Skip item2. 

a4 . iteml = al . iteml AND 

a4 . item2 = al . itemS AND 

a4. items = a2. itemS) AS temporary; 



Fig. 2. Generation of candidate 4-itemsets in SQL-92. Such a statement is used 
by all known algorithms that have a horizontal table layout 
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More recently, an approach called Set-oriented Apriori has been proposed [25]. 
The authors argue that too much redundant computation is involved in each 
support counting phase. They claim that it is beneficial to save the information 
about which item combinations are contained in which transaction, i.e.. Set- 
oriented Apriori generates an additional table Tk{transaetion, itemi , . . . , iteruk) 
in the A:th step of the algorithm. The algorithm derives the frequent itemsets by 
grouping on the k items of Tk and it generates Tk+i using Ti~. Their performance 
results have shown that Set-oriented Apriori performs better than Subquery, es- 
pecially for high values of k. 

4 A New Approach Using Universal Quantification 

In this section, we suggest a new approach called Quiver (Quantified Itemset 
discovery using a VERtical table layout) for computing frequent itemsets using 
SQL. It requires a vertical table layout for computing candidate and frequent 
itemsets, as defined in Section 2. In addition, it employs universal and existential 
quantifications of tuple variables. In the following, we will discuss the two phases 
of frequent itemset discovery according to Quiver, candidate generation and 
support counting. 

4.1 Candidate Generation Phase 

Before we show how to accomplish the entire candidate generation in SQL, we 
explain the key ideas of the Quiver approach using the tuple relational calculus 
notation used in [26] . We do this because the calculus is more concise than SQL 
and the universal quantification used in Quiver becomes apparent. 

Calculus Expressions. The generation of candidate fc-itemsets can be ex- 
pressed in a single calculus expression Candidate{k). However, we have decom- 
posed this expression into several subexpressions for a clearer presentation. 

In the following, we assume that we have computed Fk-i, the set of (fc — 1)- 
itemsets for some k > 2 during the previous support counting phase. All ex- 
pressions use the tuple variables ai, 02 , 03 , bi, 62 , and c referring to the same 
relation Fk-i- All /c-candidates have to fulfill the calculus query Ck = {c G 
Fk-i\Candidate{k)}. Note that the expressions shown below are actually tem- 
plates of expressions because they are parameterized. For example, the template 
A{k,p), explained below, has the input parameter k, the size of the candidate 
itemsets to be created, as well as the parameter p, the item position within an 
itemset. However, in the rest of the paper we will use term “expression” instead 
of “expression templates” for simplicity. 

The Candidate expression relies on two subexpressions, unique and Prefix- 
Pair. The unique expression can be regarded as a function that creates a new 
itemset ID that must be distinct from all existing IDs and that is guaranteed to 
be different from any ID that is returned for a different input value pair. When 
we use SQL for generating such a unique ID, we could simply use the current 
timestamp. 
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Ck 

Candidate (k) 



PrefixPair(k) 



Apriori (k) 
A{k,p) 



{c € Fk-i\Candidate{k)} 

3 ai € Fk-i 3 a 2 € Fk-i ( 

[c.itemset = unique ()] A 

[{{c.pos = ai.pos) A (1 < ai.pos < fc — 1 ) A {c.item = ai.item)) V 
{{c.pos = k) A {a2-pos — k — 1 ) A {c.item = 02. item))] A 
PrefixPair (k)) 

V&i € Fk-iih2 € Fk-i ( 

[{bi. itemset = ai. itemset) A {b2 .itemset = a2. itemset) A 
{bi.pos < k — 1 ) A {bi.pos = b2.pos)] {bi.item = b2.item)) A 
3&1 e Fk-i 3 b 2 € Fk-i { 

[{bi. itemset = ai. itemset) A {b2. itemset = a2. itemset) A 
{bi.pos — k — 1 ) A {bi.pos = b2.pos)] —> {bi.item < b2.item)) A 
Apriori {k) 

k-2 

A A{k,p) 

p^l 

3 os £ Fk-i^bi £ Fk-i'Vb2 £ Fk-i ( 

[((61. itemset = ai. itemset) A {b2. itemset = as.itemset) A 
{b2.pos < p) A {bi.pos = b2.pos)) —> {bi.item = 62 -item)] A 
[((61. itemset = ai.itemset) A {b2.item.set = 03. itemset) A 
{p ~ b2.pos < k — 1 ) A {bi.pos = b2.pos + 1 )) ^ {bi.item = b2.item)] A 
[{{bi. itemset — 02. itemset) A {b2. itemset = 03. itemset) A 
{b2.pos = k — 1 ) A {bi.pos = &2-pos)) ^ {bi.item — b2.item)]) 



The second subexpression of Candidate, called PrefixPair, finds itemset ID 
pairs of frequent (fc — l)-itemsets (01,02) that have a common prefix of size 
fc — 2. Such an itemset pair has the same item value at each position from 1 to 
fc — 2 , and the item value of the first itemset at position fc — 1 is lexicographically 
ordered before that of the second itemset. For example, we will create a new 
itemset ABCD for C4 if we find the itemsets ABC and ABD in F3, which have 
the common prefix AB. 

The PrefixPair calculus expression contains universal quantifications and log- 
ical implications. An implication of the form f ^ g expresses the fact that if / 
holds then must g hold, too. For example, we can phrase the for-all expression 
in PrefixPair as follows: “For all item combinations {b \. item, b2. item) of item- 
sets ai and 02, if we look at the same position of any but the last item of both 
itemsets, then they must have the same item value at this position.” The exis- 
tential expression that follows the universal expression can be phrased as: “At 
position fc — 1, the item value of the first itemset that we aim to find has to be 
lexicographically less than that of the second itemset.” 

The Quiver approach tries to reduce the number of candidates by ignor- 
ing all itemsets that do not fulfill the Apriori property, like in the K- Way- 
Join algorithm, described in Section 3.1. Thus, PrefixPair contains another 
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expression called Apriori(k), which has to hold as well. This expression checks 
if, apart from oi and 02 , which we know are frequent, all other subsets of 
size fc — 1 are frequent, too. For example, for the potential candidate ABCD, 
Apriori(4) checks if the subsets A CD and BCD are contained in F 3 . We already 
know that ABC and ABD have to be frequent because we used them for the 
construction of the potential candidate. Hence, we can ignore these checks in 
Apriori{A). In general, Apriori{k) tests the existence in Fk-i of those (fc — 1)- 
itemsets that we get when skipping a single item at the positions 1 to fc — 2 
from a potential candidate /c-itemset. For candidate itemset ABCD, for exam- 
ple, we skip position 1 to get BCD and position 2 to get ACD. This Apriori 
check is represented by a conjunction of k similar expressions A(k,p), each hav- 
ing a different value of the position parameter p, where 1 < p < A: — 2, i.e., 
Apriori{k) = A{k, 1) A A{k, 2) A ... A A{k, k — 2). Each expression A{k,p) checks 
if such a (fc — l)-itemset is frequent that we get when we skip the item at position 
p of the potential candidate fc-itemset. 

Ftom Calculus to SQL. We can easily derive SQL statements from the tuple 
relational calculus expressions specified in the previous section. For example, an 
implication can be replaced by a disjunction, i.e., we transform f g = - 1 / V g 
into “NOT / OR 5 .” Unfortunately, there is no universal quantifier available in 
SQL. Therefore, we translate (Va; G T : f{x)) = G T : ^f(x)) into “NOT 
EXISTS (SELECT * FROM T AS a; WHERE NOT /(a;)).” In addition, we can 
use De Morgan’s rule for pushing a negation into a conjunction or a disjunction, 
for example ^(/ A p) = ^/ V ~^g. 

Figure 3 shows the SQL-92 statements of the candidate generation phase in 
Quiver that are equivalent to the tuple relational calculus given above. While the 
calculus expressions to compute C}~ are generic, the SQL statements are shown 
for the example value fc = 3. 

The first SQL query shown in Figure 3 populates the pefix-pair table P 
{itemseCnew, itemseColdi, items eCold 2 ) , where itemseCnew is a newly created 
unique ID and the other attributes belong to the ID pair of frequent (fc — 1)- 
itemsets with a common (fc — 2)-prefix. In this query, we have merged the ID 
generation with the computation of prefix pairs, i.e., the calculus expression 
Unique is translated into the newid() function returning a unique identifier (of- 
fered by Microsoft SQL Server 2000, for example). 

The second statement derives the candidate fc-itemsets. For each record in 
the table P, we copy each item and its corresponding pos value belonging 
to itemseUoldi into the target table Ck{itemset, pos, item), together with the 
newly created itemset ID value itemseUnew. In addition, we add another record 
{ID, k, item) to Ck, where the item value of itemseUold 2 is taken from position 
fc — 1. This procedure is illustrated with example data in Figure 4, where a single 
candidate 4-itemset is generated. 

4.2 Support Counting Phase 

We have seen how universal quantification is used to generate Ck in the Quiver 
approach. These candidates now have to be checked if they appear frequently 




Frequent Itemset Discovery with SQL Using Universal Quantification 



205 



SET Sk = 3; 

INSERT INTO P (itemset _new, itemset_oldl , itemset_old2) 

SELECT newidO , itemsetl, itemset2 
FROM ( 

SELECT DISTINCT al. itemset AS itemsetl, a2. itemset AS itemset2 
FROM F2 AS al , F2 AS a2 
WHERE NOT EXISTS { 

SELECT * FROM F2 AS bl, F2 AS b2 

WHERE {(bl. itemset = al. itemset) AND (b2. itemset = a2. itemset) AND 

(bl.pos < ©k-1) AND (bl.pos = b2.pos)) AND NOT (b2.item = bl.item) 

) AND 
EXISTS ( 

SELECT bl.item, b2.item 
FROM F2 AS bl, F2 AS b2 

WHERE (bl. itemset = al. itemset) AND (b2. itemset = a2. itemset) AND 

(bl.pos = ©k-1) AND (bl.pos = b2.pos) AND (bl.item < b2.item) 

) 

AND 

— In the following, we skip the item at position p of itemset al . 

— The following EXIST clause has to be added for each value of p, 

— where 1 <= p <= k-2. 

— Skip item at position p = 1. 

EXISTS { 

SELECT a3 . itemset 
FROM F3 AS a3 
WHERE NOT EXISTS ( 

SELECT bl.item, b2.item 
FROM F3 AS bl, F3 AS b2 
WHERE NOT ( 

— Condition 1: 1 <= i < p 
( NOT ( 

(bl. itemset = al. itemset) AND (b2. itemset = a3. itemset) AND 
(1 <= b2.pos) AND (b2.pos < 1) AND (bl.pos = b2.pos) 

) OR 

(bl.item = b2.item) 

) AND 

— Condition 2: p <= i < k-1 
( NOT ( 

(bl. itemset = al. itemset) AND (b2. itemset = a3. itemset) AND 
(1 <= b2.pos) AND (b2.pos < ©k-1) AND (bl.pos = b2.pos + 1) 
) OR 

(bl.item = b2.item) 

) AND 

— Condition 3: i = k-1 
( NOT ( 

(bl. itemset = a2. itemset) AND (b2. itemset = a3. itemset) AND 
(b2.pos = ©k-1) AND (bl.pos = b2.pos) 

) OR 

(bl.item = b2.item) 




INSERT INTO C3 (itemset, pos, item) 

SELECT p.itemset_new, f.pos, f.item 

FROM F2 AS f, P AS p 

WHERE f. itemset = p . itemset_oldl 

UNION 

SELECT p.itemset_new, @k, f.item 
FROM F2 AS f, P AS p 
WHERE f. itemset = p . itemset_old2 AND 
f.pos = ©k-1; 



Fig. 3. Candidate generation with SQL according to the Quiver approach 
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Fig. 4. Example data for candidate generation using the Quiver approach 

enough in the transactions to qualify for Fk- We propose a new approach for 
support counting that uses universal quantification as well. 

Before we discuss the new approach, we show a vertical approach for support 
counting that is equivalent to the horizontal approach K- Way-Join, described in 
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INSERT INTO S3 (itemset, support) 

SELECT al. itemset, C0UNT(*) 

FROM C3 AS c, T AS tl, T AS t2, T AS t3 
WHERE c.iteml = tl.item AND 

c.item2 = t2.item AND 

c.itemS = tS.item AND 

tl .transaction = t2. transaction AND 

tl .transaction = t3. transaction 

GROUP BY c. itemset 

HAVING COUNT(*) >= Sminimum_support ; 

INSERT INTO F3 (itemset, iteml, item2, item3) 
SELECT c. itemset, c.iteml, c.item2, c.item3 
FROM C3 AS c, S3 AS s 
WHERE c. itemset = s. itemset; 

(a) Original, horizontal ver- 
sion of K- Way- Join 



Fig. 5. The support counting phase in the SQL algorithms K-Way-Join and 
Vertical K-Way-Join. This example derives F 3 , the set of frequent 3-itemsets 

Section 3.1. The direct translation into the vertical layout is given in Figure 5. 
We show this mapping because the Quiver approach for support counting is 
similar to the vertical version of K- Way-Join: We replace the explicit check of 
an item value at each of the k positions {pos values 1, 2, and 3 in Figure 5) with 
a universal quantification that checks all positions available in the candidate k- 
itemset. The figure also shows the intermediate result table Sk containing only 
the support count information for each itemset. The final result is derived 
by joining Ck with Sk- If we are not interested in storing the support counters 
into a separate table then the statement for deriving Sk can be merged with the 
second query, which computes Fk- 

Calculus Expressions. The new approach for support counting is defined in 
tuple relational calculus as follows: 

Query = {ci G C,ti € T\ Contains} 

Contains = Vc2 £ C 3 t 2 £ T{ 

{c2 -itemset = a -itemset) 

{{t2 -transaction = ti -transaction) A (t2-item = ti-item))) 

The expression Contains derives combinations of transactions and candi- 
dates. It has two free tuple variables ci and ti, where ci represents a candi- 
date itemset and t\ is a transaction that contains the itemset. The quantified 
(bound) tuple variables C 2 and ^2 represent the items corresponding to Ci and ti, 
respectively. The universal quantification lies in the condition that for each item 
C2-item belonging to itemset ci-itemset, there must be an item t2-item belonging 
to transaction ti -transaction that matches with C 2 . 

A combination (ci,fi) fulfilling the calculus query {c\ £ C, ti £ T\Contains} 
indicates that the itemset Ci -itemset is contained in the transaction ti-trans- 
action- We can find the support of each candidate by counting the number of 
distinct values ti -transaction that appear in a combination ci-itemset- We do 
not show the actual counting because the basic tuple relational calculus does 
not include aggregate functions. 



INSERT INTO S3 (itemset, support) 

SELECT al. itemset, C0UNT(*) 

FROM C3 AS cl, C3 AS c2, C3 AS c3, 

T AS tl, T AS t2, T AS t3 
WHERE cl. itemset = c2. itemset AND 

cl. itemset = c3. itemset AND 

tl. transaction = t2. transaction AND 
tl. transaction = t3. transaction AND 
cl. item = tl.item AND 

c2.item = t2.item AND 

c3.item = tS.item AND 

cl. pos = 1 AND 

c2.pos = 2 AND 

c3.pos = 3 

GROUP BY cl. itemset 

HAVING C0UNT(*) >= ®minimum_support ; 



INSERT INTO F3 (itemset, pos, item) 

SELECT c. itemset, c.pos, c.item 
FROM C3 AS c, S3 AS s 
WHERE c. itemset = s. itemset; 

(b) Vertical version of K- Way- 
Join 
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Prom Calculus to SQL. The calculus query can be translated into SQL in the 
same way as explained in Section 4.1. It is important to note that the resulting 
query in Figure 6 applies the aggregation on the set of unique transaction IDs 
because duplicates can occur as a result of the query processing. 



INSERT INTO S (itemset, support) 

SELECT itemset, COUNT (DISTINCT transaction) AS support 
FROM ( 

SELECT cl. itemset, tl . transaction 
FROM C AS cl, T AS tl 
WHERE NOT EXISTS ( 

SELECT * 

FROM C AS c2 
WHERE NOT EXISTS ( 

SELECT * 

FROM T AS t2 

WHERE NOT (cl. itemset = c2. itemset) OR 

(t2 . transaction = tl. transaction AND 
t2.item = c2.item))) 

) AS Contains 
GROUP BY itemset 

HAVING support >= ®minimum_support ; 

INSERT INTO F (itemset, pos, item) 

SELECT c. itemset, c.pos, c.item 

FROM C AS c, S AS s 

WHERE c. itemset = s. itemset; 



Fig. 6. Support counting phase in Quiver as a SQL query. The parameter k for 
the candidate table Ck is omitted, i.e., this query is the same for every iteration 
of the algorithm 



5 Performance Experiments 

We first present the results of several experiments using SQL-based algorithms on 
a commercial database system. It is no surprise that they revealed a performance 
of the Quiver approach that is sometimes several orders of magnitude worse than 
the fastest known SQL-based approaches. This discouraging result is due to the 
absence of operators realizing efficient set containment tests of Quiver as well as 
an adequate syntax in SQL to express these tests. 

We then discuss experiments using our own implementation of query execu- 
tion plans (QEPs) using a Java class library for building query processors that 
demonstrate that a QEP using a set containment division operator can result in 
higher performance than equivalent plans derived from a query optimizer of a 
commercial DBMS, which do not offer such an operator. 

5.1 Experiments with a Commercial DBMS 

We compared the Quiver approach to K-Way-Join, Subquery, and Set-oriented 
Apriori, discussed in Section 3.1, as well as to a vertical version of K- Way- 
Join, proposed in Section 4.2. Remember that Quiver and Vertical K- Way-Join 
use a vertical table layout for both itemsets and transactions, while the other 
approaches use a horizontal table layout for itemsets and a vertical layout for 
transactions. 

The commercial DBMS used for the experiments was Microsoft SQL Server 
2000 Standard Edition running on a 4-CPU Intel Pentium-Ill Xeon PC with 900 
MHz, 4 GB main memory, and Windows 2000 Server. 

The synthetic data sets used for the experiments have been produced using 
a Java implementation of the well-known IBM data generation tool [23]. Similar 
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data sets have been used in numerous publications for association rule discovery 
algorithms. By convention, our data sets are called T5.I5.10k and T5.I5.D100k, 
respectively. The data set T5.I5.D10k/T5.I5.D100k consists of 10,000/100,000 
transactions (D), each containing 5.896/5.445 items (T) on the average, resulting 
in 58,964/544,590 rows. The largest size of a transaction is 17/12 items. 

Several indexes have been created on the intermediate and result tables of the 
algorithm, summarized in Table 2. Although it is possible to experiment with 
all attribute combinations for indexing tables of the vertical layout like P (the 
equivalent of the calculus expression PrefixPair in Section 4.1), it requires much 
effort to do so for large values of k for horizontal tables like Ck ■ Hence, all we can 
say is that our index choice is one out of many. However, we have made a careful 
analysis on what indexes to offer to the optimizer. Note that the performance 
results reported in [10] on algorithms based on SQL-92, mentioned in Section 3.1, 
are very detailed but they do not cover the entire number of indexes neither. 
Hence, we leave as future work a comprehensive study on selecting the most 
promising indexes for the algorithms. 

For each algorithm, we give the tables used by the algorithm: K- Way-Join 
(^^ C^, Si Fj}), Subquery (T^ TJ, Q/, Sk, Fj}), Set-oriented Apriori (T", 
TJ, Tl Ql Cl Sk, Fj}), Vertical K-Way-Join P, Cl Sk, F^), and Quiver 
(T’', P, Cl Sk, F^). The attributes of each table are also listed in Table 2. Tables 
storing itemsets in a horizontal/vertical layout are marked by superscript h/v, 
respectively. 

Figure 7 shows the results of our experiments that we briefly summarize in 
the following. 

Candidate Itemset Generation. Experiments with both data sets have shown 
that the candidate generation phase based on a vertical table layout (Quiver and 



Table 2. Overview of indexes created on tables used for SQL-based algorithms 



Layout 


Table 


PI: Primary/clustercd index 


SI: Secondary indexes 


vertical 


{transaction, item) 


PI: {transaction, item) 


SI: {transaction) , {item), {item, transaction) 


vertical 


TJ {transaction, item) 


PI: {transaction, item) 


SI: {transaction) , {item), {item, transaction) 


horizontal 


{transaction, item\, . . . , itemk) 


PI: {transaction, itemi, . . . , itemk) 


SI: {transaction) , {itemi, • ■ • : itemk) 


horizontal 


{itemset, item\, . . . , itemk) 


PI: {itemset, itemi , ■ ■ ■ , itemk ) 


SI: {itemset) , {itemi), • • • , {itemk) 


horizontal 


Ok {itemset, item\, . . . , itemk) 


PI: {itemset) 


SI: {itemi), ■ • ■ , {itemk) 


horizontal 


{itemset, item\, . . . , itemk) 


PI: {itemset) 


SI: {itemi), ■ • ■ , {itemk) 


vertical 


O^ {itemset, pos, item) 


PI: {itemset) 


SI: {item), {pos) 


vertical 


{itemset, pos, item) 


PI: {itemset, item) 


SI: {itemset) , {item) , {pos) 


- 


Sk {itemset, support) 


PI: {itemset) 


SI: {support) , {itemset, support), {support, itemset) 


- 


P {itemset.new, 

itemset-oldi , itemset-old2 ) 


PI: {itemset.new) 


SI: {items et-old2), {itemset-old2) 
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(b) Candidate itemsets generation phase 





(c) Frequent itemsets counting phase 



Fig. 7. Results of performance experiments with a commercial DBMS. The data 
sets are T5.I5.D10k, minimum support of 1% (100 transactions) on the left and 
T5.I5.D100k, minimum support of 0.1% (100 transactions) on the right. Note 
the logarithmic scale of the y-axes 

Vertical K- Way-Join) was slower than the horizontal approach. For some values 
of k, the response time differed by more than two orders of magnitude. This 
difference is mainly caused by an expensive processing of the numerous corre- 
lated (NOT) EXISTS subqueries that compute the table P, shown in Figure 3. 
This query has a growing complexity for increasing itemset cardinalities. Due 
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to space restrictions, we cannot elaborate here on the different characteristics of 
the QEPs produced by the optimizer of Microsoft SQL Server. 

Frequent Itemset Counting. The results show that for the smaller data set 
(T5.I5.D10k), K-Way-Join is superior to all approaches in all but the last it- 
eration of the frequent itemset counting phase. All algorithms but Quiver had 
a performance that stayed within a corridor of at most an order of magnitude 
from each other. Quiver was far from an acceptable response time. 

The experiment with the larger data set (T5.I5.D10k) has confirmed the claim 
in [25] that Set-oriented Apriori performs better than Subquery for late passes 
of the algorithm, i.e., for high values of k. We do not give the response time for 
Quiver on the right of Figure 7(c). The experiment was running an unacceptably 
long time and we had to stop its execution. 

These experiments have demonstrated that universal quantification is not well 
supported by modern commercial database systems. We have performed similar 
experiments with another commercial DBMS but we cannot report on the details 
here. However, our observation has been confirmed by the other experiments. 

5.2 Experiments with XXL 

The aim of our research is to investigate if Quiver queries can be executed 
more efficiently in a database system that provides implementations of a set 
containment division or set containment join operator. We have implemented 
algorithms for these operators in a testbed based on the open source Java class 
library XXL [27], designed for building query processors. 

We have reimplemented the QEP for a certain Quiver query chosen by SQL 
Server in our prototype and recreated the data set and indexes used by the 
DBMS. The QEP realized the query deriving S 4 (Figure 6) from the data set 
T5.I5.D10k with a minimum support of 1%. In addition, we have reimplemented 
the original, horizontal version of K- Way-Join for the same problem and data 
setting. We compared the execution time for these QEPs to an alternative QEP 
that employs a set containment division operator. Due to lack of space we cannot 
show the QEPs here but refer to [17] for a more thorough discussion of the 
implementations. 

Figure 8 shows that for a small number of candidates, the QEP employing 
set containment division performed close to or better than the other plans. 

The experiments with XXL have been conducted on a 4-CPU Sun Ultra 
80 server with Sun UltraSPARC-II 450 MHz CPUs, 4 GB main memory. Sun 
Solaris, and JRE 1.4.1. 



6 Conclusion 



In this paper, we have investigated the problem of frequent itemset discovery and 
compared several solutions based on SQL-92. The discovery of frequent itemsets 
is generally composed of two phases: candidate generation and itemset counting. 
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Fig. 8. Results of performance experiments with query execution plans imple- 
mented using XXL. The number of candidate itemsets is 4 on the left and 32 on 
the right. Note the logarithmic scale of the y-axes 



Itemset counting can be regarded as a relational division problem. Several algo- 
rithms are available that allow a database system to efficiently process division 
within a query. 

Today, commercial database systems have no implementation of the division 
operator, which is partially because it is hard to detect a division inside a query 
and there is no keyword available in any SQL implementation that would fa- 
cilitate the specification of a division. Furthermore, division is not considered a 
frequently occurring problem in query processing [26] and it can be circumvented 
by indirect query formulations. 

There are two main approaches to represent transactions and itemsets in 
SQL-92: a horizontal table layout, where all items of an object are stored in a 
single row, and a vertical table layout, where an object spans as many rows as its 
number of items. We have presented a new approach called Quiver that employs 
a vertical table layout for both transactions and itemsets. All known algorithms 
based on SQL-92 use a horizontal table layout for itemsets. Because of the ver- 
tical table layout, the queries for both phases of frequent itemset discovery can 
employ universal quantification, as shown in our Quiver approach. The reason 
why we investigated such an approach using for-all quantifiers is because it al- 
lows a natural formulation of the frequent itemset discovery problem — counting 
the number of transactions that contain all elements of a given itemset. 

Ongoing and future work is devoted to realizing efficient algorithms for set 
containment division and set containment join operators. The latter is based on 
set-valued attributes, i.e., a nested, non-INF representation of data. As men- 
tioned in Section 1.2, we currently believe that a nested representation cannot 
add substantial performance gains over an unnested representation. The crucial 
containment test problem has to be solved either way and some effort has to go 
into initially transforming data into a nested representation, if it is not grouped 
or even sorted already. 

Even if Quiver does not prove to be an appropriate solution for real data 
mining applications, we believe that similar approaches based on a natural rep- 
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resentation of the mining problem in SQL will narrow the gap between data 
mining algorithms and database systems. 
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Abstract. Mining Frequent Itemsets is the core operation of many data 
mining algorithms. This operation however, is very data intensive and 
sometimes produces a prohibitively large output. In this paper we give 
a complete set of rules for deducing tight bounds on the support of an 
itemset if the supports of all its subsets are known. Based on the derived 
bounds [I, u] on the support of a candidate itemset I, we can decide 
not to access the database to count the support of 7 if / is larger than 
the support threshold (7 will certainly be frequent), or if u is below the 
threshold (7 will certainly fail the frequency test). We can also use the 
deduction rules to reduce the size of an adequate representation of the 
collection of frequent sets; all itemsets 7 with bounds [?,ti], where I = u, 
do not need to be stored explicitly. To assess the usability in practice, we 
implemented the deduction rules and we present experiments on real-life 
data sets. 



1 Introduction 

Mining frequent itemsets is a core operation in many data mining problems. 
Since their introduction [1], many algorithms have been proposed to find frequent 
itemsets, especially in the context of association rule mining [1,2,12]. 

The frequent itemset problem is stated as follows. Assume we have a finite set 
of items X. A transaction is a subset of X, together with a unique identifier. A 
transaction database I? is a finite set of transactions. A subset of X is called an 
itemset. We say that an itemset I is s-frequent in a transaction database V if the 
number of transactions in T> that contain all items of I is at least s. The number 
of transactions that contain all items of / is called the absolute support of I. 
The frequent itemset problem is, given a support threshold s and a transaction 
database T>, find all s-frequent itemsets. In the remainder of the paper, we will 
always assume that we are working over a transaction database T> with items in X. 

All algorithms for mining frequent itemsets rely heavily on the following 
monotonicity principle [16] to prune the search space: 

Let J C 7 be two itemsets. In every transaction database T>, the support 
of 7 will be at most as high as the support of J. 
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Thus, based on the support of a set that is below the support threshold, we 
can deduce, using the monotonicity rule, that also the support of its supersets will 
be below the threshold. This simple rule of deduction has successfully been used 
in practice. Because of the success of this simple rule, much more attention went 
into efficient counting schemes than into finding additional ways to prune the 
search space. The standard example of an algorithm exploiting this monotonicity 
is the well-known Apriori-algorithm [2]. Apriori traverses the itemset-lattice level 
by level; in the tth loop, itemsets of cardinality i are counted in the database. 
Because of the monotonicity principle, all itemsets in loop i that have at least 
one subset that failed the support-test can be pruned; we know a priori that 
they will be infrequent. In this way we will never count itemsets that could be 
pruned using the monotonicity rule. 

In this paper we present deduction rules, additional to the monotonicity rule, 
that calculate lower and upper bounds on the support of a candidate. As such, we 
continue work initiated in [9] . Based on the supports of all subsets of an itemset 
I, the deduction rules we present, will compute bounds [?, u] on the support of I. 
We show that the rules calculate the best possible such bounds; that is, both I 
and u are possible as supports of I, and thus, the interval cannot be made more 
tight. Based on these bounds we can limit the number of candidates we need to 
count. For example, if I is above the support threshold, then we know without 
counting its support in the database that I is frequent. If there is no need to 
know the support of I exact, we can thus, in this case, omit counting I. If u is 
below the threshold, then we know for sure that / is not frequent, and we can 
prune it. 

Besides reducing the number of candidate itemsets, we can also use the de- 
duction rules to make concise representations [15] of the frequent itemsets. We 
call an itemset derivable if its lower and upper bound are the same. Thus, an 
itemset is derivable if its support is uniquely determined by the supports of 
its subsets. Therefore, for the derivable itemsets, it is not necessary to count 
their supports. There is also no need to store them; we can later always find 
the missing supports with the deduction rules. Based on this observation, the 
NDI-representation is defined. We shortly discuss relations with other concise 
representations in the literature, including /ree sets [5], closed sets [18,4,19], and 
disjunction-free sets [6]. 

The organization of the paper is as follows. In Section 2 we give an example 
showing that the monotonicity rule is not complete for the deduction of supports. 
This example also gives a sketch of the general approach we follow to derive the 
deduction rules. In Section 3 we formally define important notions we will use 
throughout the paper. In Section 4, the deduction rules are given, and it is 
proven that they are complete. In Section 5 we present a concise representation 
based on the deduction rules. Section 6 gives the results of experiments with the 
deduction rules. In Section 7 we discuss related work and Section 8 concludes 
the paper. 
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2 Motivating Example 

Apriori does not prune perfectly. Consider the following database. 



TID 


Items 


1 


A,B 


2 


A,C 


3 


B,C 



Suppose we are running the Apriori-algorithm on this database T> with min- 
imal absolute support equal to 1. Apriori starts with counting the supports of the 
singleton-itemsets in Ci = {{^}, {S}, {C}}. Since they are all frequent, in its 
second loop, Apriori will consider the candidates in C 2 = {{A, B}, {A, C}, {B, C}}. 
Again all candidates are frequent, and thus, Apriori counts C 3 = {{A, B, C}} in 
its third loop. However, the following observation shows that from the supports 
counted so far, we can derive that {A, B, C} must be infrequent. 

Let for each itemset I, Ti (T>) denote the set of transactions 

^i{v) =def {{tid,i')ev\ /' = /} , 

and let // be the cardinality of Hence, in the database T> given in (1), 

Tab{V) = {(1,AH)}, TAciV) = {(2, AC)}, and TBciV) = {(3,HC)|. For all 
other itemsets /, Ti{V) is empty. Notice that \ I CX} forms a partition 

of T>. This partition is illustrated in Fig. 1. The dots in this figure represent the 
transactions of V. With every item a set is associated. The set associated with 
item A consists of all transactions that contain A. The partition defined by the 
sets Ti{V) is indicated in the figure. 




The next lemma expresses the supports of the itemsets in function of the 
numbers //, I QX. 

Lemma 1. For each itemset I , 

support (I ,T>) = E • 
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Proof. 



support {I, V) 



\{{tid,I') €V \ I C I' CJ} 

\{{Ud,nGV\I" = I'}\ 

ICI'GX 

^ \:Fj,{v)\ = Y. . 

ICI'CI -fC/'CI 



□ 

In Fig. 1, the grey region indicates the set of all transactions that contain 
the itemset AB. The transactions that contain AB are exactly those of the form 
{tid, {A, B}) and {tid, {A, B, C}). Hence, the set of transactions in V containing 
AB is Tab{T^) U Tabc{T^)- 

In the running example, after the second loop, we have the following inform- 
ation: 

support{{} ,V) =3 support{A,V) =2 support{B ,V) =2 

support \c ,V) =2 support { AB, V) = 1 support { AC, V) = I ( 2 ) 

support{BC ,T>) = 1 

Therefore, the following equalities hold^: 

/{} + fA + Jb + fc + Iab + /ac + Ibc + Iabc = 3 ({}) 

Ja + Iab + Jac + Jabc = 2 (A) 

Jb + Jab + Ibc + Jabc = 2 (B) 

{ fc + /ac + Ibc + Iabc = 2 (C) ( 3 ) 



Jab + Iabc = 1 (-4H) 

/ac + /abc = 1 

^ IbC + fABC = 1 {BC) 

For example, the equation fA + fAB + fAC + fABC = 2 expresses that the support 
of A equals 2. (3) expresses the same information as (2). 

Furthermore, since // = \Ti\, it is also true that 

/{}j / a , Jb, fc, fAC, fBC, fAB,fABC > 0 (4) 



We now show how we can derive from (3) and (4) that fABC must be 0, and 
hence that the support of ABC is 0. Rewriting (3) gives: 

{ /{} = -fABC fAB = 

fA = fABC fAC = 

fB = fABC fBC = 

fc = fABC 

Since both and fA are greater than or equal to 0 (Cfr. (4)), the first two lines 
of (5) imply respectively fABC < 0 and fABC > 0- Thus, from the information 
in (2), it can be derived that support {ABC,T>) must be 0, and hence we know a 
priori that ABC cannot be frequent. Nevertheless, Apriori does not prune ABC . 
This example shows that pruning can be improved beyond monotonicity. 



1 - fABC 
1 — fABC 
1 — fABC 



(5) 



^ A similar representation is also used in [9,7,8]. 
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3 Definitions 

In this section we formalize the notions used in the example of last section. We 
introduce support expressions to model information about the supports of the 
itemsets. The notions of implication and tight implication express what can be 
derived from a set of support expressions. 

Definition 1. A support expression over X is an equality 

support (I) = s , 

with I an itemset overX and s an integer greater than or equal to 0. 

A transaction database V over X is said to satisfy a support expression 
support{I) = s if and only if support {I, T>) = s. 

A transaction database is said to satisfy a set of support expressions S if and 
only if it satisfies every expression in S. □ 

During the execution of the Apriori-algorithm, the support of ever larger item- 
sets is counted. In the theory we develop, the knowledge of the supports of the 
itemsets accumulated in the previous counting steps is modelled as a set of sup- 
port expressions. In the example in Section 2, the knowledge given in (2) is 
expressed by the following set of support expressions: 

{ support{{}) = 3, 

support(A) = 2, support(B) = 2, support{C) = 2, 

support(AB) = 1, support(AC) = 1, support(BC) = 1 

In the candidate generation and pruning phases, it is decided which sets to 
count in the next iteration. The decision of which sets will be counted is based 
solely on the supports of the itemsets counted so far. For example, a set ABC 
is a candidate in the next loop, only if all three sets AB, AC, and BC were 
found frequent. If, for example, AB is infrequent, then it can be derived that the 
support of ABC is below the support threshold as well. Indeed, from support 
expression support(AB) = s it follows that the support of ABC must be in the 
interval [0, s]. Such deductions are formalized as logical implication in the next 
definition. 

Definition 2. Let I be an itemset overX, and let l,u>0 be integers. 

A set of support expressions S is said to imply bounds [l,u] on the support 
of I, denoted S |= support(I) € [l,u], if in every transaction database V that 
satisfies S, I < support{I ,T>) < u holds. 

The bounds [l,u] are said to be tight, denoted S Htight support(I) G [l,u], if 
there does not exist a smaller interval [V ,u'] C [l,u] such that S |= support(I) G 
[l',u']. □ 

Implication denotes what we can derive from a set of support expressions. 
Given a set of support expressions S, the deduction of support (I) G [l,u] is 
correct — or sound — if and only if it is true in every database that satisfies S. 
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Tight implication denotes that the bounds cannot be improved; S Htight 
support{I) G [l,u] indicates that, given S, both support(I) = I as support(I) = u 
are possible. Hence, based on S, we cannot improve the interval [/, u]. Therefore, 
a deduction mechanism is complete if, given S and a target set I, it always 
produces the tight interval for I . 

Example 1. 

{ support{{}) = 3, 

support(A) = 2, support(B) = 2, support(C) = 2, 

support(AB) = 1, support(AC) = 1, support(BC) = 1 

From the monotonicity rule we know that S ^ support(ABC) G [0, 1]. The 
interval [0, 1] however, is not tight for the support of ABC. From the reasoning 
in Section 2, we know that in every database that satisfies S, the support of 
ABC must be 0. Hence, S l=tight support(ABC) G [0,0]. □ 

The next lemma makes a similar connection between support expressions and 
systems of linear inequalities as in the example in Section 2. 

Lemma 2. Let S be a collection of support expressions over T. There exists a 
transaction database V over! that satisfies S, if and only if the following system 
of inequalities has an integer solution in the variables xi, I C T: 

r X/ > 0 V/ C I 

Sys{S) =def \ xji = Si y{support{I) = sj) G S 

[ /C/'CI 

Proof. If: Consider an integer solution of Sys{S). In such a solution all xj’s are 
integers greater than or equal to 0. Let now T> be the transaction database that 
for all / C I contains Xj transactions of the form {tid,I). Hence, for all / Cl, 
//(I?) = xj. Using Lemma 1, we obtain: 

V/ C I : support{I,V) = E = E xp (6) 

ici'ci IQI'QI 

For all support expressions support(I) = sj in S, Sys{S) contains the equality 
~ hence, via (6), support{I ,V) = sj. Thus, T> satisfies S. 

Only if: Let I? be a transaction database that satisfies S. Then xj = fi{T>), for 
all / C X is an integer solution of the system Sys{S). Indeed, for all I, fi{T>) 
is greater than or equal to 0. Furthermore, since T> satisfies S, for all support 
expressions support{I) = sj in S, support{I,T>) = s/. Because of Lemma 1, 
support{I,V) = and hence E/c/'ci //'(^) = ^i. □ 

Example 2. There exists a transaction database T> with support {{},T>) = 3, 
support{A,V) = 2, support{B ,V) = 2, and support{AB,T>) = 0 if and only if 
the following system of inequalities has a solution: 

^ ^ A: ^ B ^ ^ AB ^ 0 
X{} + XA + xb + xab = 3 
XA + XAB = 2 




Xb + XAB = 2 
XAB = 0 




220 



T. Calders 



From the last three equalities we derive that xa = xb = 2, and xab = 0. 
This however conflicts with a ;^} + xa + xb + xab = 3, since all variables must 
be greater than or equal to 0. Hence, we conclude that there does not exist a 
transaction database satisfying the given support expressions. □ 

Problem Statement. In the remainder we will concentrate on implication prob- 
lems for a set I, based on a set S of support expressions that contains exactly 
one expression for each strict subset of I. We do not consider cases in which S 
contains support expressions for supersets of I, or in which subsets are missing. 
Hence, given an integer sj for all J G I, tight implication of the following type 
is studied: 

{support(J) = sj \ J C 1} htight support(I) G [l,u] . 

Notice that the information {support{J) = sj | J C /} is available for every 
candidate itemset I in the Apriori-algorithm. 

4 Deduction Rules 

In this section we describe sound and complete rules for deducing tight bounds 
on the support of a set / if the supports of all its subsets are given. Because 
we do not consider itemsets that are not subsets of I, we can assume that all 
items in the database are elements of I. Since “projecting away” the other items 
in a transaction database does not change the supports of subsets of I, we can 
assume without loss of generality that 1 = 1. The correctness of this observation 
follows from the next lemma. 

Definition 3. Let I CT be an itemset. 

- The projection of a transaction (tid,I') over T on I, denoted TTj{tid,I'), is 
the transaction {tid,I' n I). 

- The projection of a transaction database V over X on I, denoted ttiT>, is 
defined as ttjT) =def {tt/T | T G T>}. 

Lemma 3. Let X be a set of items, and let J G I be itemsets. For every trans- 
action database V over X it holds that 

support { J, T>) = support {J,TrjT)) . 



Proof. For J C /, 

support {.J,V) = \{(tid,L') GV \ J G I'}\ 

= \{{tid,L') eV \ J G{L' nL)}\ (JCJ) 

= \{{tid,L") G TTiV I JGL"}\ 

= support{J , TT jT>) 

□ 



This lemma allows for an important reduction of the system Sys{S) associated 
with a set of support expressions S that contains an expression for every strict 
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subset of I. Instead of having a variable xj for every itemset J C X, with 
Lemma 3 we can restrict the variables to only those xj such that J C I. 

Corollary 1. Given an itemset I C J, and integer sj > 0, for every J C I. 
There exists a transaction database T> satisfying VJ C I : support{J ,V) = sj if 
and only if the following system of inequalities has a solution: 

(xj>Q yj Cl 

S ^ xji = sj VJ C / 

[ JC/'C/ 



Proof. Because of Lemma 3, the existence of a database T> over T satisfying the 
given expressions implies the existence of such a database over I, namely tti'D. 
The corollary now follows from Lemma 2. □ 

Let J C X be an itemset. We assume that all supports of the strict subsets J 
of I are known, let sj denote support{J,T>). We will now derive optimal bounds 
on the support of I. These bounds can be determined as follows: the best possible 
lower bound is the smallest integer I such that the system of support expressions 

{support(J) = sj I J C /} U {support(I) = 1} 

is satisfiable. The best upper bound is the largest integer u such that 

{support{J) = sj I J C /} U {supported) = w} 

is satisfiable. Let now s/ be an arbitrary integer. From Corollary 1, we know 
that the system of support constraints 

{support{J) = sj I J C /} U {support{I) = s/} 



is satisfiable if and only if the following system of inequalities has an integer 



solution: 



xj >0 VJ C / 

X// = sj VJ C / 

.rci'ci 



This system can be solved for the xj’s as follows: 



' Si =xi 
Sl-A = Xi + Xi-A 
Sl-B = Xi + Xi-B 
Sl-AB = Xi + Xi-A 

+XI-B + Xi-AB 

Si 

Sl-A — Si 
S l-B — Si 
Xi + XI- A 
+XI-B + Xi-AB 



Xi 


= Si 


Sl-A 


= XI + XI- A 


Sl-B 


= Xl + Xl-B 


Sl-AB 


= XI + XI- A 




+XI-B + Xi-AB 


Xi 


= Si 


Xl-A 


= Sl-A — Si 


Xl-B 


= Sl-B — Si 


Xl-AB 


= Si-AB — Si-A 



— Sl-B + Sl-AB 



' Xi = 
Xl-A = 
Xl-B = 
Sl-AB = 
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In general, xj = X) as the following lemma shows. 

Lemma 4. Let I he an itemset, and for all J Q I, sj,Xj be integers. The 
following are equivalent 

(1) yj C I '■ sj = ^jci'ci 

(2) yjCI:xj = E, 7 c, 7 'c/(- 1 )'-''-‘"sj' 

(The proof of this lemma can be found in Appendix A.) 

Therefore, the system of support constraints 



{support{J) = sj I J C /} U {support{I) = 57 } 

is satisfiable if and only if the following system of inequalities has an integer 
solution: 

xj >0 VJ C / 

xj= Y. VJC/ 

JCJ'CI 



Hence, if 



or, equivalent. 



Y > 0 VJC/ 



JCJ'CI 



Si< Y VJC/,|/-J| odd 

,/C,/'C7 

5/ > Y^ (— VJC/, |/ — J| even 
,7C,/'C7 

Let ai{J,V) denote the sum 

ai{J,T>) =def Y ^^^support{J',V) 

JCJ'C/ 



and let TZi{J,T>) denote the rule support{I) < ai{J,T>) if \I — J\ is odd, and 
support{I) > ai{J,T>) if \I — J\ is even. We obtain the following theorem that 
states that the bounds for itemset I found by the rules TZi{J), for all J C I, are 
the best bounds possible; that is, the interval found is tight. 

Theorem 1. Let V he a transaction database, and let / be an itemset. sj de- 
notes support{J,V). 



{support{J) = Sj \ J <Z 1} Kight support{I) G [Z,u] 

with 

I = max{cT 7 (J, I?) I J C I, J even} 
u = mm{ai{J,T>) \ J C I, J odd} 
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Hence, the rules TZi{J),J C I are sound and complete for implication of the 
support of I, based on the supports of the strict subsets of I. 

(The proof of this theorem can be found in Appendix A.) 

The rules TZabcd{J) have been given in Figure 2. 



( support {ABC D) 



support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 
support {ABC D) 



> sabc + sabd + sacd + sbcd 

—Sab — Sac — sad — sbc — sbd — scd 
T5a + SB + Sc Sd — 1 

< sa — Sab — Sac — sad + sabc + sabd + sacd 

< sb — Sab — Sbc — sbd + sabc + sabd + sbcd 

< sc — Sac — Sbc — scd + sabc + sacd + sbcd 

< Sd — Sad — sbd — scd + sabd + sacd + sbcd 

> Sabc + sabd — sab 

> Sabc + sacd — sac 

> Sabd + sacd — sad 

> Sabc + sbcd — sbc 

> Sabd + sbcd — sbd 

> Sacd + sbcd — scd 

< Sabc 

< Sabd 

< Sacd 

< Sbcd 

> 0 



Fig. 2. Tight bounds on support {ABC D) 



Example 3. Consider the following transaction database. 



TID 


items 


1 


A,B 


2 


A,C,D 


3 


A,B,D 


4 


C,D 


5 


B,C,D 


6 


A,D 


7 


B,D 


8 


B,C,D 


9 


B,C,D 


10 


A,B,C,D 



S{} 


= 10, 


SA 


= 5, 


Sb 


= 7 


sc 


= 6, 


Sd 


= 9, 


SAB 


= 3 


SAC 


= 2, 


SAD 


= 4, 


SBC 


= 4 


Sbd 


= 6, 


SCD 


= 6, 


SABC 


= 1 


SABD 


= 2, 


SACD 


= 2, 


Sbcd 


= 4 



Figure 2 gives the rules to determine tight bounds on the support of ABCD. 
Based on these deduction rules we derive the following bounds on the support 
of ABCD without counting in the database V. 

support {ABCD,V) > 1 (Rule support{ABC D) > sabc + sacd — sac) 
support {ABCD,V) < 1 (Rule support{ABC D) < sabc) 
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Therefore, we can conclude, without having to count, that the support of 
ABCD in T> must be 1. In the experiments we will see that this exactness is not 
very unusual; even in real-life data, and for small itemsets, we will be able to 
derive very narrow intervals. □ 



5 Concise Representation 

5.1 Derivable Itemsets 

Definition 4. Let I he an itemset, and T> a transaction database. Let for all 
itemsets J Q I, sj denote support{J,V). We say that I is a derivable itemset 
w.r.t. V, if 

{support(J) = sj \ J C 1} htight support(I) G 

□ 

Notice that I derivable means that we do not have to count I in the database 
to know its support. Based on the supports of the subsets of I we can derive the 
support of L exactly. 

5.2 NDI-Representation 

Based on the notion of derivable itemsets we propose a concise representation. 
A concise representation [15] is a subset of the set of frequent itemsets, extended 
with supports, that allows for deriving the whole set of frequent sets and their 
supports. Such a concise representation is typically much smaller than the whole 
set of frequent itemsets, even though it contains the same amount of information. 
Therefore, in situations where the number of frequent itemsets is very large, it 
is often better to only mine a concise representation. Let [li,ui] be the bounds 
we can derive for itemset /, based on the supports of its subsets. We now define 
the NDI-representation as follows: 

NDI(T>,s) =def {{I, support{I,V)) \ {li yt, ui), support{I,V) > .s}. 

That is, NDI only contains those sets that are both frequent in T> and not 
derivable w.r.t. T>. 

Theorem 2. Let V he a transaction database, and let s be a support threshold. 
NDI(I?, s) is a concise representation for the s-frequent itemsets in V. 

5.3 Algorithm 

In [8], the following theorem has been proven: 

Theorem 3 (Anti- monotonicity of Derivability). Let I Q J he itemsets 
over 2, and V he a transaction database over 2. Lf I is a derivable itemset, then 
J must be a derivable itemset as well. 

Based on this theorem we come up with the following algorithm to find all 
frequent non-derivable itemsets. 
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(1) NDI(X>,s) 

(2) f := 1; NDI := {}; Cl := {W I i G X}; 

(3) for all I in Ci do 1. 1 := 0; I.u := |I?|; 

(4) while Ci not empty do 

(5) Count the supports of all candidates in Ci in one pass over X>; 

(6) Fi := {/ G Ci I support {I ,V) > s}; 

(7) NDI := NDI U Fp, 

(8) PreCi+i := AprioriCenerate{Fi); 

(9) Ci+i := {}; 

(10) for all J G PreCi+i do 

(11) Compute bounds [l,u] on support of J; 

(12) if / yf M then J.l := I; J.u := w, Ci+i := Ci+i U { J}; 

(13) i := i + 1 

(14) end while 

(15) return NDI 

For a more elaborated description of the algorithm we refer the interested 
reader to [8]. 



6 Experiments 

The experiments were performed on a 1.5GHz Pentium IV PC with 256MB 
of main memory. To empirically evaluate the proposed NDI-algorithm and de- 
duction rules, we performed several tests on the datasets summarized in the 
following table. For each dataset the table shows the number of transactions, 
the number of items, and the average transaction length. 



Dataset 


# trans. 


# items 


Avg. length 


BMS-POS 


515 597 


1 656 


6.53 


T40I10D100K 


100 000 


1 000 


39.6 


Connect-4 


67 557 


125 


42 


BMS-Webview- 1 


59 602 


497 


2.51 


Pumsb 


49 046 


2 112 


74 


Mushroom 


8 124 


120 


23 



These datasets are all well-known benchmarks for frequent itemset mining. 
The BMS-Webview and BMS-POS datasets are click-stream data from a small 
dot-com company that no longer exists. These two datasets were donated to the 
research community by Blue Martini Software. The Pumsb-daiaset is based on 
census data, the Mushroom dataset contains characteristics from different spe- 
cies of mushrooms. The Connect-^ dataset contains different game positions. The 
Pumsb dataset is available in the UCI KDD Repository [13], and the Mushroom 
and Connect-4 datasets can be found in the UCI Machine Learning Reposit- 
ory [3]. The T 4 OIIODIOOK dataset was generated using the IBM synthetic data 
generator. 
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The ND I- algorithm differs slightly from the algorithm presented in Section 5. 
In order to avoid the generation of pairs of items in the second loop, the can- 
didates are only generated while iterating over the dataset. In this way the 
generation of pairs that do not appear in the database is avoided. 

6.1 Overhead of Rule Evaluation 

We first study the influence of limiting the depth of the rules we evaluate. In 
Fig. 3, the number of sets that are derivable when we evaluate rules up to depth 
k, and the time needed to find them are indicated for different k. As can be seen, 
the number of NDIs drops quickly from depth 1 to depth 2. In the Mushroom 
experiment, the test with k = 1 was even not feasible. From depth 3 on, higher 
depths result in only a slight decrease of the number of NDIs. This is not that 
remarkable since the number of NDIs of these sizes is small. The running times 
in Fig. 3 show that for these limited depths, the cost of evaluating all rules is 
rather small. 





(a) Pumsb 



(b) Mushroom 



Fig. 3. Size of the representations when k is limited 



In Tab. I, four experiments with the BMS-POS dataset are reported in detail. 
For each iteration of the algorithm, the number of candidates and the number 
these candidates that turn out to be frequent NDIs are reported, together with 
the computation time of respectively candidate generation with rule evaluation 
and counting. In Tab. I (a) and (b), all rules are evaluated, while in Tab. 1 

(c) and (d) no rules are evaluated. Hence, the experiments in Tab. 1 (c) and 

(d) are in fact plain Apriori. The BMS-POS dataset is interesting, since it is 
the only dataset that contains almost no derivable itemsets. Therefore, these 
examples show very well the cost of evaluating the rules. The rule evaluation 
time is included in the generation time for the candidates. Tab. 1 shows that 
the evaluation times for the rules are very reasonable. This is especially so when 
the number of transactions becomes very high. For iteration 2, the number of 
candidates and the generation time of these candidates is not reported, because 
they are generated on the fly to avoid pairs of items with support 0. 
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Table 1. Example runs on the BMS-POS dataset 



\Ck\ 




\NDI\ 




Width 


1 1656 


6.437s 


510 


Os 


515597 


2 




9553 


35.828s 


11610 


3 187839 6.625s 


39912 


80.859s 


29167 


4 157481 


14.329s 74768 


120.828s 


11831 


5 117797 28.437s 77353 


126.922s 3462 


6 62233 


41.922s 47499 


93.266s 


998 


7 18369 


38.656s 


16276 


46.297s 


250 


8 2230 


18s 


2141 


16.547s 


88 


9 10 


1.906s 


9 


8.579s 


19 


268021 


686.9 








(a) BMS-POS, support 


— 361, all rules 


\Ck\ 




\NDI\ 


Width 


1 1656 


5.531s 


510 


Os 


n/a 


2 




9553 


35.719s 


n/a 


3 187839 5.031s 


39912 


80.672s 


n/a 


4 157481 


7.375s 


74768 


126s 


n/a 


5 117929 8.984s 


77361 


128.484s n/a 


6 63981 


7.406s 


47741 


95.422s 


n/a 


7 21335 


3.812s 


17293 


49.203s 


n/a 


8 3765 


1.109s 


3283 


19.797s 


n/a 


9 255 


0.282s 


228 


9.625s 


n/a 


270649 594.141s 



(c) BMS-POS, support = 361, no rules 



\Ck\ 




\NDI\ 




Width 


1 1656 


6.609s 


461 


36.047s 


515597 


2 




7554 


75.437s 


116102 


3 126338 4.719s 


27904 


103.188s 29167 


4 95701 


8.922s 


46115 


98.25s 


11831 


5 63578 


15.5s 


42047 


67.641s 


3462 


6 29226 


20.578s 


22300 


29.937s 


998 


7 7075 


14.86s 


6315 


12.031s 


250 


8 704 


5.234s 


685 


7.985s 


88 


9 1 


0.343s 


1 


8.579s 


9 


153382 508.234s 


(b) BMS-POS, support 


= 465, all rules 


\Ck\ 




\NDI\ 




Width 


1 1656 


5.5s 


461 


Os 


n/a 


2 




7554 


34.86s 


n/a 


3 126338 3.344s 


27904 


73.547s 


n/a 


4 95701 


4.359s 


46115 


102.469s 


n/a 


5 63641 


4.656s 


42048 


95.297s 


n/a 


6 30114 


3.421s 


22341 


63.532s 


n/a 


7 7967 


1.422s 


6480 


29.953s 


n/a 


8 962 


0.359s 


849 


13.079s 


n/a 


9 30 


0.188s 


29 


8.312s 


n/a 


153781 453.093s 



(d) BMS-POS, support = 465, no rules 



6.2 Comparison with Mining All Frequent Sets 

Since the overhead of calculating the rules is small, the running times of the 
Apriori-algorithm and the NDI-algorithm are almost linear in the size of their 
respective output. Therefore, the gain in speed of the NDI-algorithm over the 
extraction of all frequent itemsets with the Apriori-algorithm is more or less the 
ratio between the number of frequent sets and the number of frequent NDIs. This 
claim is supported by Fig. 4. In Fig. 4, the running time of the NDI-algorithm, 
Apriori, and FPGrowth is given, together with the number of the frequent and 




support 



Fig. 4. Running time on the Mushroom dataset 
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the non-derivable sets, for different minimal supports. In this example, the exe- 
cution time of FPGrowth is much lower than for the other algorithms. As long 
as the number of frequent sets is not too high, FPGrowth will be more efficient 
than mining the NDIs. As soon as the number of frequent sets becomes very 
high however, NDI will become more efficient. 

Since in most of the experiments we present the number of frequent sets 
is very high, the execution of the Apriori-algorithm was not always possible. 
Instead we present a comparison with FPGrowth, and we report for some of the 
experiments the number of frequent sets and the number of NDIs. The results 
are presented in Fig. 5. For the Gonnect-4 and the Pumsb dataset it was not 
possible to perform the FPGrowth algorithm within reasonable time for the 
lowest supports. 

In Fig. 5, it can clearly be seen that in most datasets once the support 
threshold becomes too low, and the number of frequent sets explodes, the NDI- 
algorithm becomes more efficient than mining all frequent itemsets. In Fig. 6, 
the number of frequent NDIs is compared with the total number of frequent sets. 
The only exceptional dataset in this perspective was the BMS-POS dataset, in 
which the performance of the NDI and FPGrowth algorithms stays more or less 
comparable. The explanation for this is in Tab. 1. In the BMS-POS dataset there 
are almost no derivable itemsets of low cardinality. 





support 



support 



(a) BMS-POS 



(b) Connect-4 





(c) Pumsb (d) Mushroom 



Fig. 5. Gomparison of running times of NDI and FPGrowth 
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18000 20000 22000 24000 26000 



(a) Connect-4 



(b) Pumsb 



Fig. 6. Comparison of the number of frequent sets and NDIs 



6.3 Comparison with Other Concise Representations 

We compared the NDI-representation with the following other concise repres- 
entations: 

— Free sets representation FreeRep [5], 

— Disjunction- free sets representation DFreeRep [6], 

~ Disjunction- free generators representation DFreeGenRep [14], and 

— Closed sets representation Closed [18]. 

In Figure 7, the sizes of these representations and \NDI\ are reported on dif- 
ferent datasets. The experiments show that on these datasets, the NDI-represen- 
tation is often the smallest representation. Only in the BMS-Webview-1 dataset, 
the NDI-representation is slightly larger than the Closed sets representation. 



7 Related Work 

7.1 Concise Representations 

Closed itemsets [18] received a lot of attention in the literature [4,19,20]. They 
can be introduced as follows: the closure of an itemset I is the largest superset 
of I such that its support equals the support of I. This superset is unique and 
is denoted by cl{I). An itemset is called closed if it equals its closure. In [18], 
the authors show that the set of frequent closed sets is a concise representation 
for the frequent itemsets. 

Free sets [5] or Cenerators [14] (Free sets [5] and generators [18,14] are the 
same.) An itemset I is called free if it has no subset with the same support. The 
free-set representation is based on the fact that is support (A) = support (AB), 
also support (AC) = support (ABC). This deduction can also be made with the 
following two deduction rules presented in this paper: 

support{ABC) < support (AC), and 

support{ABC) > support{AB) -\- support{AC) — support(A) . 
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(c) Pumsb (d) Mushroom 

Fig. 7. Size of different concise representations 

Disjunction-free sets [6] or disjunction-free generators [14] are an extension of 
free sets. A set I is called disjunction-free if there does not exist two items A, B 
in I such that 

support(I) = support{I — A) -\- support{I — B) — support{I — AB) . 

Free sets are a special case of disjunction-free sets, namely when A = B. The 
representation is based on the fact that when 

support{ABC) = support{AB) -\- support{AC) — support(A) , 
it is also true that 

support{ABC D) = support(ABD) -\- support{AC D) — support{AD) . 
Again, this deduction follows from the rules presented in this paper: 

SABCD > SABD + sacd ~ SAD , and 

SABCD < SaBC + SaBD + SaCD ~ SAB ~ SAC ~ SAD + . 



7.2 Deduction 

Another application of deduction rules is developed in [11]. Based on the obser- 
vation that highly frequent items tend to blow up the output of a data mining 
query by an exponential factor, the authors develop a technique to leave out 




Deducing Bounds on the Support of Itemsets 231 



these highly frequent items, and to reintroduce them after the mining phase by 
using a deduction rule, the multiplicative rule. The multiplicative rule can be 
stated as follows: let /, J be itemsets, then 

support{I \J J,V) > support{I,T>) + support{J,V) — support{{\ ,V) . 

This rule can be derived from the rules in our framework. 

Also in the field of artificial intelligence, much work has been done around 
inferring knowledge. Interesting related work in artificial intelligence concen- 
trates on logics for reasoning about probabilities, such as the probabilistic logic 
of Nilsson [17] and of Fagin et al. [10]. 

8 Conclusions and Further Work 

We presented sound and complete rules for deducing bounds on the support of 
an itemset. These rules have many possible applications, such as improving the 
pruning in the Apriori-algorithm, making concise representations, and deducing 
the result of a data mining query based on previous query results. We evaluated 
the rules against different real-life data set. The experiments showed the use- 
fulness of the deduction rules for mining concise representations of the frequent 
itemsets. 

For the deduction rules presented in this paper, we need to know the supports 
of all subsets exactly. Interesting further work includes finding deduction rules 
for situations in which some of the subsets are missing, and when we only have 
partial knowledge of the supports. 
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Appendix A : Proofs 



Proof of Lemma 4 



Proof. (l)=k(2): Proof by induction on \I — J\ that xj = ■ 

Base case (J = I): for J = I, (2) reduces to xj = sj. xj = sj is also in (1). 
General case (J arbitrary): By induction hypothesis, for all such that J C 
r C I, xr = I'CJ'GI i-ir- Isj'. From (1) we know that E Xp = Sj. 

■JCI'CI 

Hence, 



xj = s,j- xr = sj - Y (-1)''^' 



Sj' 



JCI'CI 



JCI'CI I'CJ'CI 

i: I i: (-1)'^'-''' 



Sj' 






= E E (-1)"' 

./CJ'C/ \ ,7C/'CJ' 



-J\ 1 (_1)|4'-J| 



'Sj' 
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JCJ'CI \i^l ^ ^ / 

JCJ'CI y ^ ^ j 

= «•/+ E = E 

.]<ZJ"ZI JQJ'QI 

(2)=^(1): Proof by induction on / — J that sj = J2jgi'gi 
Base case (J = I): for J = I, (1) reduces to sj = xj. xj = sj is also in (2). 
General case (J arbitrary): By induction hypothesis, for all J', such that 
J C J' C I, sjr = ^ xi>. Since xj = X),/cj'c/(~l)''^ also 

J'CI'CI 

sj = xj- Hence, 

sj = xj- Y xr = xj- Y f E p/' 

JCJ'CI J'GI'GI Jc/'C7 \JCJ'QI' / 

= xj- Y (-i)a;/' = E 

Jci'ci JQI'QI 

Proof of Theorem 1 

Proof. By definition, the integers I, u such that [I, u] is the tight interval implied 
for support{I) by {support{J) = sj | J C /}, are the minimal and maximal 
integer s/ such that the system 

S = {support(J) = sj I J C /} U {support(I) = s/} 

is satisfiable. Using Corollary 1 and Lemma 4, we obtain that S has a solution 
if and only if 

xj >0 VJ C / 

xj= Y VJCI 

■JCJ'CI 

has a solution. This system has a solution if and only if 

SI < Y VJC/,|/-J| odd 

JCJ'CI 

Si > ^ (— VJCJ, |7 — J| even 

JCJ'CI 



Hence, the maximal solution is the minimum of the upper bounds as given by 
TZj{J), J odd, and the minimal solution is the maximum of the lower bounds as 
given by TZi{J), J even. □ 
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Abstract. Data mining algorithms such as the Apriori method for find- 
ing frequent sets in sparse binary data can be used for efficient computa- 
tion of a large number of summaries from huge data sets. The collection 
of frequent sets gives a collection of marginal frequencies about the un- 
derlying data set. Sometimes, we would like to use a collection of such 
marginal frequencies instead of the entire data set (e.g. when the original 
data is inaccessible for conhdentiality reasons) to compute other inter- 
esting summaries. Using combinatorial arguments, we may obtain tight 
upper and lower bounds on the values of inferred summaries. In this pa- 
per, we consider a class of summaries wider than frequent sets, namely 
that of frequencies of arbitrary Boolean formulae. Given frequencies of 
a number of any different Boolean formulae, we consider the problem of 
finding tight bounds on the frequency of another arbitrary formula. We 
give a general formulation of the problem of bounding formula frequen- 
cies given some background information, and show how the bounds can 
be obtained by solving a linear programming problem. We illustrate the 
accuracy of the bounds by giving empirical results on real data sets. 



1 Introduction 

Database management systems allow querying extensional data and intentional 
data. From the user’s point of view, extensional data are the data explicitly input 
by the user into the system. Intentional data are not put in by the user — that 
information is derived from extensional data. 

In inductive database management systems the intentional data are usually 
quite complex from various points of view and require a high computational ef- 
fort to obtain. In case of typical patterns (frequent sets, association rules), the 
common problem is that the domain of patterns is prohibitively large and the 
inductive database management system cannot compute them all. The typical 
approach is to let the user guide the system to the interesting patterns interac- 
tively, e.g., through queries, limiting the search space to be considered. 

Then, the question is if the result of one query could be reused at least partly 
for obtaining the next result as an alternative to re-computing the whole answer 
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from extensional data. The following, more difficult question is whether there 
might be some summaries that the inductive database management system can 
gather off-line before the data mining process, to improve the on-line behavior of 
most mining processes. Typically, one would be interested in sufficient statistics, 
i.e., summaries that can be substituted for the whole extensional data to avoid 
repetitive probing of the selection predicate for all candidate patterns. On the 
other hand, we prefer gathering summaries that are known to be efficient to 
obtain. 

In this paper we pinpoint how to take advantage in a particular context of a 
collection of summary queries that have been evaluated against the extensional 
data to bound the value of the evaluation functions of other queries. Providing 
bounds may be interesting when we have thresholds on the evaluation function, 
and a tight bound can enable us to make a correct decision about accepting 
or rejecting a pattern in the query answer. We focus on the context of reusing 
previous queries (without pre-selecting) and leave open the question of choosing 
beforehand which summaries should be used. 

Several data mining algorithms can be used for efficient computation of a 
large number of summaries from data. Such methods include Apriori-type al- 
gorithms for finding frequent sets [AMS+96] or episodes [MTV97] in binary or 
sequential data and methods for clustering large data sets [ZRL97]. The summary 
information given by such algorithms can then be used as an efficient condensed 
representation of the data set. When the available summaries are orders of mag- 
nitude smaller than the data set itself (typical in case of huge data sets in a data 
mining context), it could be worth using them instead of the entire data set to 
compute other interesting summaries. Typically, the information contained in a 
collection of summaries will not be sufficient to compute the precise value of all 
other summaries, but at least bounds could be inferred. If the accuracy of the es- 
timated result is not enough, the partial quantitative information (bounds) can 
be used to better optimize the query execution plan (of a query to the original 
data set). 

An interesting fundamental question is: how much information about the 
underlying data set does a collection of summaries give? In this paper we con- 
sider this question in the setting of frequent sets for binary data. Information 
of frequencies of different itemsets can have strong implications for the frequen- 
cies of other itemsets. For example, if we know that^ /(Ai?) = f{A), then we 
know that f{XA) = f{XAB) for any set X, a result that has been shown to 
be surprisingly useful in the context of so-called closures [PBTL99] and free 
sets [BBR00,BR01]. Also, we know that the frequency f{X) of an itemset X is 
bounded from above by the minimum support f{Y) of a subset Y of X. 

More generally, if we possess the information about the frequencies of some 
Boolean formulae (frequent itemsets being a particular case), the frequency of 
any other Boolean formula can be inferred, to some extent. The main question we 
pose in this paper is how we could efficiently construct upper and lower bounds 



Here we denote by f{AB) the frequency of the itemset {A,B}. 
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for the frequencies of any Boolean formula, given the existing information. We 
show that this formula bounding problem can in fact be solved by transforming 
the question into a linear program and solving that problem. In the worst case 
the transformation leads to a program of exponential size, but we also give 
empirical results showing that the transformation in many cases is an efficient 
one. 

The paper is organized as follows. In Section 2 we define the basic notions and 
the support bounding task. Section 3 gives the solution, and Section 4 describes 
the empirical results. Finally, in Section 5, we summarize the paper and discuss 
open problems. 



2 Problem: Support Queries in Databases 

The problem we want to solve involves Boolean queries on binary relational 
databases. In order to present the problem in an exact way, we first make some 
formal definitions. 

Definition 1. A relation r over the finite attribute set X is a finite multiset of 
tuples, subsets of X. The degree of r is the cardinality of X, and the size of r 
is the (multiset) cardinality |r| of r. The set X is called the schema of r. 

In contrast to ordinary relational databases, we deal with binary data only. 
This allows a convenient notational shortcut: for example, instead of the tu- 
ple (0,1, 1,0,1) over the attributes A, B,C, D, E, we can talk about the tu- 
ple {B,C,E }. Also, a Boolean query such as “{A = 1 and B = 1) or (C = 1 
and D = 0)” can be written as “{A and B) or (C and not D)”, or, in the alge- 
braic notation, AB + CD. The syntax and semantics of such queries are defined 
next. 

Definition 2. A Boolean formula over the attribute set X is one of: 

1. T (the true constant^, 

2. A for some attribute A € X (an atomj, 

3. (“'</>) for some Boolean formula </> over X (a negation^, 

4- for some Boolean formulae </>, i/' over X (a conjunction 

5. {4> + if) for some Boolean formulae (f, if over X (a disjunction/ 

We omit parentheses when there is no danger of ambiguity. Furthermore, 
the negation operator ^ always binds to the shortest following subformula, and 
conjunction binds tighter than disjunction. In the case of negated atoms, we 
also write A for ^A. Thus, ABC + A{B + C) means ((((^A)i?)C') -I- {A{B + 
(^(7)))). These conventions leave it ambiguous in which direction conjunction 
and disjunction associate, but in fact all readings of an ambiguous formula have 
equivalent semantics by the following definition. 

Definition 3. Given a tuple t € r, we define the truth value of all Boolean 
formulae over the schema of r as follows. 
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1 - m* = 1, 

•2. [A]t = 1 if A G t, \A\t = 0 if A ^ t, 

3- = 1 “ W\t, 

4- = [0]i bP]t, 

5. [(0-hV')]i = 

Two formulae (j) and ip are equivalent, if they always have the same truth 
value on the same tuple. That all readings of an ambiguous formula such as (p'f'O 
are equivalent is a standard result in propositional logic. Different in our problem 
is that we extend the semantics to whole relations. 

Definition 4. Let r be a relation and (p a Boolean formula over a common 
schema. Then the support of (p in r is the proportion of tuples in r for which <p 
is true, 

t£r 

We write \(p] when the relation is clear from the context. 

The data mining literature contains a wealth of material on itemsets, sets of 
attributes. After Boolean formulae have been defined, it is easy to give semantics 
to itemsets as simple conjunctions. 

Definition 5. Let X he a relation schema. A subset of X is an itemset, and it 
is identified with the conjunction of all its elements. 

The name “itemset” originated in association rule mining, whose traditional 
application is market-basket data: the attributes are items offered for sale at 
a supermarket, and the tuples are customer transactions. It turns out that to 
find association rules that are in a certain sense interesting, it suffices to com- 
pute all itemsets whose support exceeds a threshold. This is usually done by a 
breadth- first search algorithm called Apriori [AMS“*'96], but several variations 
have been proposed. For example, depth-first search can be performed using FP- 
trees [HPYOO], and a sampling approach can avoid database scans [Toi96]. An 
active area of research is mining not all frequent itemsets but only an interesting 
subfamily; see e.g. [GZ01,CG02,PBTL99,BBR00,BR01]. 

As an example. Table 1 shows a small binary database. We have e.g. \A\t = 
[A]„ = [AB]t = [ACJu = 1 and [A]„ = [R]„ = [AR]„ = [AC]^ = 0, and over the 
whole database = 2/3, = 1/3, [C]r = 1, [AC\r = 2/3, and [ABC]^ = 

1/3. If the frequency threshold is 1/2, the frequent itemsets are A, C, AC, and 
trivially the empty set, which corresponds to T. 

Table 1. An example database r 



Tuple ABC 

t 111 
u 10 1 

V 0 0 1 




238 A. Bykowski et al. 



We now come to the formula bounding problem. Given are a set <P of Boolean 
formulae over a schema X, and their supports in an unknown relation r. The 
desired result is the support of another formula ip. This support is sometimes 
completely determined by the givens, but this is rare; in general we want the set 
of all possible supports. As it turns out, the minimum and maximum support 
determine this set completely, and we can allow minima and maxima also as in- 
puts. We denote by IntQ(0, 1) the set of closed intervals [a, 6] of rational numbers 
where 0 < a < b < 1. 

Definition 6. The formula bounding task BouND(Jf, <?, /, ip) has the following 
inputs: a relation schema X, a set <P of Boolean formulae over X, a function f 
from <P to lnt((j(0, 1), and a Boolean formula ip over X. The solution of the task 
is the smallest set I C [0, 1] such that [ip]r G I for all relations r over X fulfilling 
the constraint \(p]r G f{<p) for all (p G <P. 

In other words, we want a sound and complete inference procedure for the 
support bounds of Boolean formulae. We call a procedure sound if its result I 
rules out no possible solutions: for q ^ I, there should be no relation r fulfilling 
the constraints defined by / such that [ip]r = q. Conversely, the set / returned 
by a complete procedure is such that every solution q G I is realizable in some 
relation fulfilling the constraints. (A trivially complete but non-sound procedure 
returns I = lb for all inputs; the similar sound but non-complete procedure 
always returns I = [0, 1].) The problem is NP-hard, since it requires solving the 
satisfiability of ip. 

The following lemma shows that it suffices to find upper and lower bounds 
for the numbers in I. Thus, the task has a closure property: the output is in the 
same form as each element of the input. 

Lemma 1. If the solution I of Bo\JNb{X, <P, f, ip) is nonempty, then I is an 
interval of rational numbers. 

Proof. We must prove that given any three rationals P,W,Q G [0, 1] with P < 
W < Q and P,Q G /, also W G I. Since W lies between P and Q, there is 
a rational number Z G (0, 1) such that W = ZP -k (1 — Z)Q. Since P,Q G I, 
there exist relations p, q fulfilling the constraints of the bounding problem such 
that [ip]p = P and [ip]q = Q. We will construct a relation w for which the support 
of all formulae 9 over X is = Z [9]p -k(l — Z) [9]q. Since [9]yj lies between the 
numbers [9]p and [9]q, every inequality constraint [(p]yj G f{(p) will be satisfied. 
Further, [ip]w = W, as required. 

To construct the relation w, we would like to take Z/\p\ copies of all tuples 
in p and (1 — Z)/\q\ copies of all tuples in q. This is impossible in the general 
case, but if we multiply the numbers Z/\p\ and (1 — Z)/\q\ by the least common 
multiple of their denominators, we can replace the numbers by integer multiples. 
It is then easy to check that [9]u] = Z [9]p +{1 — Z) [9]q for all formulae 9. 
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3 Solving the Bounding Task by Linear Programming 

In this section, we describe a solution to the Bound task of Definition 6. The 
solution is based on linear programming, and it is both sound and complete. 

3.1 Change of Variables 

Several kinds of equalities and inequalities hold in all relations. For example, 
\A + B] = [A\ + \B] — [AB] by the combinatorial inclusion-exclusion principle, 
and 0 < [AB] < [^] < 1 by the anti-monotonicity of support. A procedure for 
the Bound task has to incorporate all results of this type. 

Let us analyze how these results could be proved. The middle inequality 
follows from the observation that [A] = [AB] + [AB] and that the support of [AB] 
lies in the interval [0, 1]. A similar idea gives a proof of the inclusion-exclusion 
formula: 

[A + B] = [AB] + [AB] + [AB] 

= i\AB] + [AB]) + i\AB] + [AB]) - [AB] = [A] + [B] - [AB]. 

This suggests that a change of variables can make the needed results simpler to 
prove. To that end, we make the following definitions. 

Definition 7. Given an attribute A S X , the positive literal based on A is the 
Boolean formula A, and the negative literal based on A is the Boolean formula A. 
A literal based on A is either the positive literal or the negative literal based on A. 

Definition 8. A clause over the attribute set X is a conjunction of zero or more 
literals based on different attributes. 

Our definition of a clause allows an attribute to appear at most once, in 
either a negative or a positive literal. For example, BCE is a clause over the set 
{ A, B, C, D, E }, but BBE and BBE are not. The true constant T is a clause 
as the degenerate case of zero literals. 

Definition 9. A full clause over the attribute set X is a clause with exactly |X| 
literals. 

In a full clause each attribute appears exactly once, either as a negative or a 
positive literal. For example, the conjunction ABODE is a full clause over the 
set {A, B, C, D, E}, whereas BCE is not. In the language of propositional logic, 
a full clause fully describes a model over the given attribute set. 

Full clauses are important for two reasons. First, there is a natural corre- 
spondence between relations and assignments of supports to full clauses. Given 
a relation r, any full clause 6 over the schema of r is satisfied by some nonnega- 
tive integral number of identical tuples in r. Conversely, given an assignment of 
nonnegative rational supports for all full clauses summing up to 1, it is simple 
to construct a relation giving rise to these supports. 

The second reason is that all formulae can be decomposed into full clauses 
(for formulae corresponding to typical queries it is easy). We record this in the 
following two propositions. 
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Proposition 1. Every Boolean formula over an attribute set X can he equiva- 
lently written as a disjunction Ci + C 2 + • — h Cp of distinct full clauses Ci- We 
call this the full disjunctive normal form. 

Proposition 2. The support of any Boolean formula (j) over an attribute set X 
can he written as a sum of supports of distinct full clauses. That is, there is a 
set of full clauses C\, C 2 , . . . ,Cp such that \(j)\r = [Ci]r + [C 2 ]r H — • + [Cp]r for 
any relation r over X. 

These results enable us to untangle the complex interrelations of formulae. 
The supports of distinct full clauses are independent of each other^, so any 
distribution of nonnegative supports for full clauses corresponds to a possible 
relation. Where the support of a Boolean formula appears in a constraint equality 
or inequality, we can invoke Proposition 2 to replace it by a sum of the supports 
of the corresponding full clauses. This amounts to a linear change of variables. 

As an example, we consider an instance of Bo\jnd{X, <P, f,-tf;) with X = 
{A,B}, <l> = {T ,A,B,AB}, and if = AB. After the change of variables, we 
have the system depicted in Table 2 which we should solve for [AB], We have 
the additional information that 0 < [0^] < 1 for all formulae 9i, but we need not 
worry about the inclusion-exclusion principle or similar rules. We continue this 
example at the end of Section 3.2. 

Table 2. Example bounding task with decomposition into full clauses 



AB AB AB AB 


m = 1.0 


X X X X 


[A] = 0.6 


X X 


[B] = 0.7 


X X 


[AB] e [0, 0.5] 


X 



3.2 Linear Programming 

We now turn to the classic optimization problem called linear programming. 
We only describe it briefly; see, e.g.. Chapter 21 in [Kre93] for a good intro- 
duction to the subject, or the Linear Programming FAQ^ for a comprehen- 
sive list of references. For the computational complexity of linear programming, 
see e.g. [MSW96]; briefly, common algorithms such as Simplex tend to be use- 
ful in practice although they have worst-case exponential complexity, but more 
sophisticated algorithms such as Karmarkar’s algorithm [Kar84] achieve lower 
complexity. 



^ With the restriction that the supports of all full clauses sum up to 1; but this gives 
only a scaling factor. 

® http : / /www-unix .mcs . anl . gov/otc/Guide/f aq/linear-pr ograimning-faq.html. 
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Definition 10. The linear programming problem IjP {A, B,C) comprises an mx 
n matrix A, an m-element column vector (m x 1 matrix) B, and an n-element 
row vector (1 x n matrix) C. The solution of the problem is the vector x that 
minimizes the scalar value Cx subject to the restrictions Ax < B, x > 0. We 
also denote by TV' {A,B,C) the problem that is otherwise similar but where the 
first restriction is replaced by Ax = B. 

The matrix C is said to express the objective function, and A and B state 
the constraints of the problem. 

The problems LP(.4, B, C) and LP'(.4, B,C) are equivalent in expressive power 
and computational complexity. We use the first formulation in the fully general 
case of the formula bounding task Bound (Definition 6). For the kinds of inputs 
we get from Apriori and similar procedures, we actually have equalities for all 
input formulae, so we can use TV' {A, B, C). Note that equalities y = z can always 
be converted to the inequalities y < z and y > z. We map the problem Bound 
into an instance of a linear programming problem TV{A,B,C) or TV'(A,B,C) 
(depending on the kind of input). We talk about LP and inequalities in the 
following, but the case of LP' and equalities is similar. 

Assume now that / is the solution of an instance of Bound( A, <?,/, ■(/;). By 
Lemma 1, we know that if the set I is nonempty, it is a subinterval of [0, 1] 
(in rationals). Therefore, we proceed to compute its infimum; the case of the 
supremum is symmetric. Denote n = |A|, and denote the 2" full clauses over X 
by 0 i, 6 » 2 , . . . , 6 » 2 ". 

As input to Bound we have in effect a large set of inequalities that we will 
convert into one big matrix inequality Ax < B. The vector x will contain the 
unknowns: let x = ([ 6 *i] [ 6 * 2 ] ... . Then, Proposition 2 yields for every 

formula a binary vector k = (ki k 2 ■ ■ ■ ^ 2 ") such that the support of </> 

can be written as a matrix product, [</>] = kx. Using this fact, we encode the 
constraint [0] € f{4>) by adding into A two rows, —k and k, and into C two 
numbers, —a and b, where [a, 6 ] = /(0). Then any x satisfying Ax < B must 
satisfy a < kx < b. Finally, as a necessary consistency constraint corresponding 
to the fact [T] = 1, we add the rows (—1 — 1 . . . —1) and (11 ... 1), and the 
numbers —1 and 1. All in all, the dimensions of A will be 2{\<P\ -I- 1 ) x 2 ”, and the 
dimensions of x and B will be 2{\<P\ -|- 1) x 1. Ways to reduce these dimensions 
will be discussed after Theorem 1. 

Having encoded all the constraints of the problem in A and B, we now have 
to select C so that the solutions to the LP problem correspond to the supports 
of if. We once again invoke Proposition 2 to turn [tjj] into a sum of supports 
of full clauses. Thus C will be a 0/1 vector with Cx = [ip], and minimizing Cx 
subject to the constraints gives the required infimum. When the bounds for the 
supports of input formulae are rational numbers, linear programming yields a 
rational value for the infimum, since for example the Simplex algorithm [Kre93, 
§21.3] uses only sums, differences, products and ratios to solve LP. Thus, the 
infimum corresponds to an assignment of nonnegative rational values to the 
supports of the full clauses, summing to 1 and obeying all the constraints of the 
original problem. Multiplying all the supports by the least common multiple of 
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their denominators gives integer counts, whence a relation can be constructed. 
Thus the infimum is actually a minimum. 

We have now proved the following theorem. 

Theorem 1. The formula bounding task Bound(X, <?, /, ^/>) can he reduced to 
the linear programming task LP(^, ,8, C). The matrix A will have 0(|^?|) rows 
and 2” columns, and the vectors B and C will respectively have 0(|^|) and 2" 
elements, where n = |AT|. 

The output from our reduction has size 0(2^\<P\), i.e., exponential in the num- 
ber of attributes, where for the sake of simplicity we assume that all the numbers 
are represented using a fixed number of bits. Thus, a linear programming algo- 
rithm that requires polynomial time in the size of its input will take time that is 
polynomial in <P but exponential in n. It would, therefore, be useful to diminish 
the exponential dependency on the number n of attributes. 

First, if consists of frequent itemsets, we can restrict X to only those 
attributes that appear in the query f). To see this, consider two full clauses 9 
and O' whose only difference is that 0 has A and O' has A, where A is an attribute 
that does not appear in ip. The two coordinates in C corresponding to 0 and O' 
will be equal, and thus only the sum [0] -|- [O'] will be relevant to the objective 
function Cx. If a frequent set (p G <P has different coordinates at the positions 
corresponding to the two full clauses, it must include A; then there is a frequent 
set (p' G that differs from (p only by excluding A. Thus in removing <p from <1> 
we lose no information relevant to [ip]. Once all such frequent sets are gone, we 
can remove the attribute A from X. 

Second, we discuss whether using the family of all 2” full clauses is necessary. 
One of the reasons we used full clauses was that they can be used to answer any 
support queries of Boolean formulae. However, many other families of formulae 
have this property. For example. Proposition 1 of [MT96] implies that the family 
of all conjunctions of atoms can be used to determine the supports of all Boolean 
formulae. Let us define a representation 0 over X as a, family of formulae such 
that the counts of all Boolean formulae over X can be determined from the counts 
of the formulae in 0. In this context, we use integer counts countr(6>) = 
instead of supports [0]r = counti.(0)/ countr(T). 

Any representation that works for all r must have 2" formulae. Indeed, given 
the counts corresponding to a representation, we can use Proposition 2 to form 
a linear system of equations from which the counts of full clauses can be solved. 
If there are fewer than 2" equations, the system is underdetermined, and since 
all its factors are integers, it will have infinitely many integral solutions. It is 
therefore relatively easy to construct two relations with the same counts of all 
formulae of the supposed representation but different counts of some full clauses. 

However, this does not rule out smaller representations that work for specific 
relations. When storing the counts of the conjunctions-of-atoms representation, 
we can leave out some counts that can be derived from others. If, e.g., there are 
no tuples satisfying the conjunction AB, we can leave out the count of ABC, 
and if the counts of D and DE are equal, we need store only one of the counts 
of AD and ADE. Similar ideas have been studied in [ML98,BBR00,BR01]. 
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In our problem, we use fractional supports, not counts, which removes one 
degree of freedom. Since the supports of full clauses must add up to 1, we can 
leave out one number from the full-clauses representation. 

Third, in the case of LP^, where ^ is a 0/1 matrix, we can often reduce 
the problem. If some row of the matrix A is less than or equal to another 
row aj, we can replace aj by aj — ai, while doing the corresponding replacement 
in B. Sometimes this will result in a zero in B; we can then deduce that several 
unknowns are zero and remove them. Even if this doesn’t occur, the matrix 
becomes sparser, which helps some algorithms that solve linear programming 
problems. 

We now continue the example bounding task of Table 2. We reduce the system 
depicted in the table to LP(^, B, C) with x = {[AB] [AB] [AB] [AB])'^ . For exam- 
ple, the second equation is translated from [AB] + [AB] = 0.6 to (1 1 0 0)a? < 0.6 
and (—1 —1 0 0)a; < —0.6. These inequalities form the third and fourth lines of A 
and B (see below). In this case, the first equation already forms the consistency 
constraint so we need not add it now. 

We obtain the values of x, A and B listed in Table 3, and C=(l 0 0 0) 
(resp. C=(— 1 0 0 0)) for finding the lower (resp. the upper) bound of [AB]. Solv- 
ing these two LP problems gives the minimum 0.3 (with x = (0.3 0.3 0.4 O.O)"”") 
and the maximum 0.5 (with x = (0.5 0.1 0.2 0.2)"^). We can obtain actual 
relations by multiplying the values of a; by 10. 



Table 3. The example bounding task converted into a linear program 



([AE]\ 

[AB] 

[AB] 

wmi 



A = 



/ 1 1 1 i\ 

/ _i _i _i _i \ 

110 0 
- 1-1 0 0 

10 10 
-1 0-1 0 

I 1 0 0 0 / 

\ -1 0 0 0 / 



5 = 



- 0.6 
0.7 
- 0.7 
0.5 I 
0 / 



4 Experiments 

We investigated the properties of the bounding procedure on two data sets. The 
first is connect-4 containing some game-state descriptions, the second is anpe, a 
database about unemployed people, set up by the French unemployment agency. 
We describe the specific properties of the data sets along with our results in 
Sections 4.2 and 4.3. 

We used as input to the bounding procedure different collections of frequent 
itemsets along with their supports [AMS+96,MT96]. As explained previously, an 
itemset is interpreted as the Boolean conjunction of items that it contains. Dif- 
ferent collections of frequent itemsets correspond to different support thresholds, 
denoted by minsupp- 

In the implementation of the experiments, we used a less voluminous, al- 
though totally equivalent, representation of frequent itemsets, first described 
in [BROl]. Since this representation is smaller than all frequent itemsets, the re- 
sulting <1> contains fewer queries. The equivalence of representations guarantees 




244 A. Bykowski et al. 



that the same information can be inferred from it as from all frequent itemsets 
and their supports. We verified the equivalence by repeating some of the exper- 
iments using the ordinary frequent itemsets, and got exactly the same results. 



4.1 The Framework of the Experiments 

We can compute the support of a Boolean formula over an itemset X exactly, if 
we know the supports of all subsets of X. The procedure for this computation 
in [MT96] is also applicable when we know the supports of frequent sets only, 
but then it will yield approximate bounds — it is sound but not complete. Thus, 
we test our new contribution using formulae over infrequent itemsets. 

The protocol of the experiments can be simply put as following: we com- 
pare the average size of intervals inferred by Bound for 100 formulae, for 
which the combinatorial support-computing procedure of [MT96] is confronted 
with infrequent (thus missing) terms. The infrequent terms are due to the fact 
that the support threshold we use to mine frequent itemsets (considered fur- 
ther in the experiments with their corresponding supports as formulae with 
known supports) exceeds the support of some terms required by the procedure 
of [MT96]. 

The detailed protocol is the following. For each of the two data sets, we 
selected k = 100 random itemsets Xi, ...jXk that have 10 items each and whose 
supports do not exceed a predefined amax (10% for connect-4 and 0.1% for 
anpe). To avoid selecting only itemsets with very low support, which typically 
account for the clobbering majority of all itemsets, we weighted the probability 
of selecting an itemset X proportionally to its support [X], Even then, most of 
the selected itemsets have low support compared to amax (on average, 2.26% for 
connect-4 and 0.010% for anpe). 

Based on these itemsets, we randomly drew k Boolean formulae ipi , ..., ipk, one 
formula, tpi, over each Xi. To mimic formulae of interest in real life, for each Xi 
we first selected a subset Yi C Xi of items, each item of Xi with probability 0.7. 
Then we defined ipi as a disjunction of random full clauses over Fj. We included 
each full clause 9 in %pi with probability 0.5 — 0.04J, where j is the number 
of negative literals in 9. Thus, we preferred clauses with more positive literals. 
For example, a clause with 10 negative literals had the probability of 0.1 to 
be included in ipi- Then we computed Bo\JND{Xi,(I>i, fi,tpi) where <l>i consists 
of the precomputed frequent sets among the subsets of Xi, and fi assigns to 
each frequent set its known support. We report two scores, each an average over 
the 100 computations. Denoting the resulting lower and upper bounds by Li 
and Ui for each computation, the first score is the average of Ui — Li, the second 
the average of {Ui — Li)/Ui, both averages over t G {1, . . . , 100}. 



4.2 Experiments with connect-4 

The connect-4 data set is very dense. It contains relatively small number of items 
(129) and rows (67 557). 
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connect-4 




Support threshold of itemsets in the input of BOUND (%) 

- - ■ - - Relative interval size (left scale) Absolute interval size (right scale) 



anpe 




0.05 

0.04 

0.03 

0.02 

0.01 

0.00 



Support threshold of itemsets in the input of BOUND (%) 



- ■ ■ - ■ Relative interval size (left scale) Absolute interval size (right scale) 

Fig. 1. Average interval size vs. input itemsets’ support threshold produced by 
BOUND on the connect-^ and anpe data sets 



In Figure 1 (top) we report the average size of the interval returned by the 
bounding procedure for different values of miusupp- The right-hand scale (dia- 
monds) reports the difference between the ends of the interval, and the left-hand 
scale (squares) reports the ratio of this difference to the upper limit of the inter- 
val. As we can see, a lower miUsupp results in a better bounding precision. This 
is due to the increasing number of input itemsets, therefore a richer collection of 
information about the original data set. However, the cost associated with the 
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computation and the use of these more voluminous collections of summaries also 
increases. 

The support computation of [MT96] potentially involves an exponential num- 
ber of terms for a single Boolean formula. Typically the computation will involve 
a significant part of the lattice of itemsets, subsets of the itemset on which we 
base our random formula. Since our random formulae are based on infrequent 
itemsets, many terms (often a majority) have unknown supports. However, the 
interval size we observe is of the same order of magnitude as the support thresh- 
old cr, which bounds the error of each unknown support. It seems that the un- 
known supports tend to cancel out, which appears to be a promising result. 

Let us take an example. Consider the support threshold of cr = 15% and that 
the itemsets on which our random formulae are based have an average support 
of 2.26%, and never have a support above a^ax = 10%. Take a single itemset 
and the corresponding formula; typically, a great number of the itemset’s subsets 
are infrequent, each having support in the [0 ,ct) range. When we compute the 
support of the formula as in [MT96], naturally most errors will cancel out, but 
one would not exepct the overall error to be in the [0, ct) range; our method 
yields an average uncertainty of less than 10%. 

4.3 Experiments with anpe 

The anpe data set is quite uncorrelated. With its 214 items and over 109 000 rows, 
it is significantly larger than connect-4- Frequent set mining extracts relatively 
small collections, unless we set a very small minsupp- We chose to extract itemsets 
at these low thresholds. In Figure 1 (bottom) we report the average interval sizes. 
As previously, we relate the scores to different minsupp- 

The results look fairly similar to those of the previous experiment. In com- 
paring the graphs it should be noted that the scaling of the axes is different: in 
this experiment, both the relative and the absolute errors are below 0.1 for all 
runs. In other words, this less dense data set allowed much greater precision in 
the resulting intervals. 

4.4 Observed Running Times 

In our experiments, we first gather summary query answers, to simulate either 
off-line or on-the-fly collecting of highly processed information. Then, we draw 
random formulae, as described in Section 4.1. For each random formula, we 
execute two steps: conversion into an \jP' {A, B,C) problem and solving it. 

During the experiments, we observed that frequent itemset mining is the 
most expensive phase, despite the optimization of using an efficient condensed 
representation of the itemset collection described in [BROl]. For example, for 
the connect-4 data set and minsupp = 5% it took more than 3000 seconds. 
Conversion to \jP' {A,B,C) took 4.78 seconds per formula on average, and solv- 
ing LP'(.4, ,8, C) took only about 3.1 seconds per formula. Thus, the bounding 
procedure can be quite efficient in practice, after the frequent itemset mining 
has been performed. 
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5 Discussion and Future Work 

We have considered the problem of bounding the support of Boolean formulae 
when some aggregate information is available. We showed that the bounding 
problem can be reduced to a linear programming problem whose size can in 
the worst case be exponential in the number of attributes. While our result 
is foremost a theoretical one, we also gave empirical results showing that the 
bounding method can be effectively used to obtain additional information from 
frequent itemsets or other summaries. 

We emphasize that our aim is to find exact bounds. Another approach would 
be to approximate the frequency of the query and give some kind of tail bounds 
for the error of the approximation. The most natural way would be to take a 
sample from the database and compute all queries on the sample; thus, instead of 
frequent sets, the sample would serve as the representation of the original data. 
This kind of a method has been used for computing frequent sets (see [Toi96]). 
A more sophisticated approximation can be based on frequent sets (or simi- 
lar summaries) by building a probabilistic model over the variables occurring 
in the formula. A method using the maximum entropy principle is described 
in [PMSOO]. Like our solution, it suffers from exponential complexity in the 
number of variables occurring in the query. 

Calders and Goethals [Cal02,CG02] have studied a similar problem. They 
have derived deduction rules for bounding the support of an itemset given the 
exact supports of all its proper subsets. While the rules are sound and complete 
for that task, they don’t solve our more general problem. In particular, these 
rules are not applicable when the supports of some subsets are unknown. Thus 
they cannot derive directly the support of a derivable itemset, but must first 
bound recursively the supports of all its proper subsets. They deal only with 
itemsets, i.e., conjunctions of attributes, not arbitrary formulae. The full set of 
rules is exponentially large, although Galders and Goethals give experimental 
evidence that a small subset of the rules suffices to give a reasonably good 
result. 

Several open problems remain. One area is obtaining a faster method for the 
inference problem. With large, redundant summaries such as frequent itemsets, 
the solution by linear programming is quite slow, and it is in many cases outper- 
formed by the simple “scan the database once and count” method. The method 
could, however, be useful in cases where the data set is not available or where the 
set of queries (corresponding to known supports) carries a lot of information 
condensed in well chosen summaries, orders of magnitude smaller than the data 
set itself. Thus, the following fundamental issue is interesting. 

Problem 1. Given a relation r, an amount Z of storage, and a class of queries 'P 
that we wish to perform on r, what should we store in Z (which presumably 
cannot hold all of r) in order to most effectively answer the queries in PI 

Frequent sets are typically redundant collections, and thus are not optimal. 
In fact, in our experiments we used the smaller collection of disjunction-free 
sets [BROl], and further gains could be obtained using the Galders-Goethals 
rules [GG02] . Another interesting representation is the AD-tree [ML98] . In gen- 
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eral, if we store in Z the answers to some Boolean queries (j)i, (j)2, ■ ■ ■ , <f>N, the 
linear programming approach shows the limits of what we can reconstruct. Per- 
haps a suitable set of formulae would allow an analytical solution, possibly only 
approximate, of the linear program. The problem of computing frequent sets 
from data has been extensively studied, and they were used in [MT96], which 
formed the starting point for our research. But the linear programming frame- 
work does not depend on them — it can be used with supports of any formulae. 

Another interesting issue is how to relax (if possible) the requirements of 
Definition 6 if the complete procedure is too slow. We do not want to give un- 
sound answers, but too wide intervals are not necessarily harmful. The simplest 
incomplete and sound algorithm “return the interval [0, 1]” is not useful, but 
we suspect there might be a reasonably fast compromise between it and the 
complete linear programming approach. 

Problem 2. How close to completeness can a polynomial-time (or linear-time, 
or randomized polynomial-time) sound solution to Bound come? 
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Abstract. In this paper, we propose a general framework for condensed 
representations of sets of mining queries. To this end, we adapt the stan- 
dard notions of maximal, closed and key patterns introduced in previous 
works, including those dealing with condensed representations. Whereas 
these previous works concentrate on condensed representations of the 
answer to a single mining query, we consider the more general case of 
sets of mining queries defined by monotonic and anti-monotonic selection 
predicates. 



1 Introduction 

In the past decades, the problem of discovery of interesting patterns in large 
databases has motivated many research efforts. Whereas these works have fo- 
cussed mainly on the efficiency of the algorithms [1,6,12,16], some other issues 
have been recently considered, among which the problem of efficient storage of 
the result of an extraction [4,14,15]. In this paper, we propose a general frame- 
work for condensed representations of the answers to a set of mining queries. 
More precisely, we assume that we are given: 

1. A set A of all data sets A from which the patterns are to be discovered. 

2. A partially ordered set of patterns L, where the partial ordering is denoted 
by 

3. A set of selection predicates Q, a selection predicate being a boolean function 
defined over L x A. 

4. A set of measure functions F, a measure function being a real function defined 
over L X A. 

Moreover, given a selection predicate q and a data set Z\ in A, we say that 
a pattern (/? in L is interesting in A with respect to q if q{(p, A) has the value 
true. Any selection predicate is also called a simple mining query and the set of 
interesting patterns in A with respect to q, denoted by sol{q/A), is called the 
answer of q in A. 
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We call extended mining query any pair of the form (g, /) where q is in Q 
and / is in F. The answer in A to an extended mining query (g, /), denoted by 
ans{q, f /A), is the set of pairs {ip, f{(p, Z\)) such that ip is in sol{q/A). 

In the following example, that will be used as a running example throughout 
the paper, we illustrate these notions in the classical association rule mining 
problem of [1]. 

Running Example 1. Given a set of items Items, the set of patterns L con- 
sidered in our approach is the set of all subsets of Items, i.e., L = 
Moreover, the partial ordering over L that we consider is set inclusion: given 
two patterns ip and p' in L, we say that p A if Q f- 

In this context, a data set A is defined by a set of transactions Tr and a 
function it from Tr to L. Given a transaction x € Tr, it{x) is the set of items 
in transaction x. The support of a pattern is an example of measure function 
of F. More precisely, for every pattern p, the support of p in A, denoted by 
sup{p,A), is defined by: 

sup{p,A) = |{a; € Tr \ it{x) A </ 2 }|/|Tr|. 

Note that, given a minimal support threshold minsup, we can consider the 
selection predicate q defined by: for every pattern g? € L, q{p, A) = true if 
sup{p, A) > minsup. 

In the rest of the paper, we consider the case where the set of items is Items = 
{A,B,C,D,E} and where the set of transactions is Tr = {1, 2, . . . , 10}. For 
the sake of simplicity, sets of items are denoted by the concatenation of their 
elements, e.g. the set of items {A, B, C} is denoted by ABC . The function it 
from Tr to L that defines the data set A is represented in the table of Figure 1. 



Tr 


Set of Items 


1 


A 


2 


DE 


3 


ABCE 


4 


ABE 


5 


ABCDE 


6 


ACD 


7 


ABCE 


8 


AE 


9 


ABCDE 


10 


CD 





Fig. 1. Example of data set and sub-lattices of interesting patterns 



by: 



Let mi, TO 2 , oi and 02 be selection predicates defined for every pattern p in h 
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— mi{{p, A) = true if B C ip, m 2 (<p, A) = true if AC p. 

— ai{p, A) = oi(<P, A) A oi{p, A), where A) = true if sup{p, A) > 0.4, 
and a\{p, A) = true if p C ABCE. 

— a 2 {p,A) = Z\) /\ 02 {p,A) where 02 {p,A) = true if sup{p,A) > 0.3, 

and U 2 {p, A) = true if p C ABCD. 

qi = mi A ai and q 2 = m 2 A U 2 are simple mining queries. Moreover, it is easy 
to see from the table in Figure 1 that: 

sol{mi A ai/A) = {B, AB, BC, BE, ABC, ABE, BCE, ABCE} 
sol{m2 A 02 / A) = {^, AB, AC, AD, ABC, ACD} 

On the other hand, {q\,sup) and {q 2 ,sup) are examples of extended mining 
queries, and we have: 

ans{qi,sup/A) = {(B,0.5), (AB,0.5), (SC, 0.4), (SS,0.5), {ABC,Q.A), 
(ASS, 0.5), (SC'S, 0.4), (ASCS, 0.4)} 
ans{q 2 ,sup/A) = {(A, 0.5), (AS, 0.5), (AC, 0.5), (AS, 0.3), (ASC, 0.4), 
(ACS, 0.3)1 

The answers in A of {qi, sup) and {q 2 ,sup) are also represented in Figure 1. □ 

In the case of a simple mining query q, we recall that sol{q/ A) can be com- 
puted without any access to Z\ if only the maximal and minimal elements of 
sol{q/A) (with respect to the partial ordering on L) are known [9,12]. Indeed, 
denoting these sets by C{q/A) and S{q/A), respectively, we know that a pattern 
p is in sol{q/A) if and only if there exist pg G C{q/A) and ps G S{q/A) such that 
fs A P A Pg- Since C{q/A)CS{q/A) C sol{q/A), we say that {C{q/A), S{q/A)} 
is a condensed representation of sol{q/A). 

In our Running Example 1, it can be seen that C{qi/A) = {S| and S{qi/A) = 
{ABCE}. Thus, sol{qi/A) is the set of all itemsets p such that B C p C ABCE, 
and this can be computed independently from A. 

On the other hand, in the case of extended mining queries, we adapt the 
notions of closed patterns and of key patterns ([3,4,16]) to our formalism, which 
allows us to obtain condensed representations of the set ans{q, f / A) (see Sec- 
tion 3.3). For instance, in our Running Example 1, for <71 and the function sup, it 
will be seen that the answer ans{q\, f / A) can be computed without any access 
to A, from the three sets {B}, {ABCE}, and {{ABE,0.5),{ABCE,0.A)}. In 
this case, we say that these three sets constitute an extended condensed repre- 
sentation of ans{qi, sup/ A). 

As the main contribution of this paper, we consider the case of sets of mining 
queries (simple or extended) . Noting that the union of condensed representations 
of different mining queries is not a condensed representation of the corresponding 
set of mining queries ([9]), we extend the notions of maximal, minimal, closed 
and key patterns to the case of sets of mining queries. Then, we propose con- 
densed representations for such sets, in the sense that, given a set Q of mining 
queries, the answers in A of the queries in Q can be computed based only on 
the condensed representation, i.e., without any access to the data set A. 

In the case of our Running Example 1, consider the set of simple mining 
queries Q = {qi,q 2 }- Then, it will be seen in Section 4 that the sets of pairs 
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{{ABCE, gi), {ACD, ( 72 )} and {{B, qi), {A, < 72 )} constitute a condensed represen- 
tation of sol{qi/A) and sol{q 2 /A). We would like to emphasize that in the first 
set above, the maximal element ABC in sol{q 2 / A) does not appear in the given 
condensed representation. Thus, in condensed representations of sets of mining 
queries, some maximal or minimal elements with respect to single mining queries 
can be omitted. 

Comparing our approach to that of [6,7], we note that in [6,7] the authors 
consider conjunctive queries made of monotonic and anti-monotonic primitives, 
which correspond to what we call simple mining queries. Moreover, it is shown 
in [6,7] that the answer to one such query can be represented by its minimal and 
maximal elements only. However, contrary to the present paper, the case of sets 
of queries is not considered. 

On the other hand, in [4], the authors also consider conjunctive queries. 
They use a caching technique to store condensed representations of the answers 
to these queries together with their supports. In our terminology, this corre- 
sponds to extended mining queries. However, in [4], each answer is condensed 
separately and stored in the cache, whereas our approach allows to benefit from 
relationships between the queries in order to further condense the answers to the 
queries. 

Thus, our approach can be seen as an extension of [6,7] and [4]. In this paper, 
however, we do not consider computational aspects, such as the computation and 
the maintenance of condensed representations. 

The paper is organized as follows: In Section 2, we give the formal definitions 
of the basic concepts of our approach, and in Section 3, mining queries, condensed 
representations as well as maximal, closed and key patterns are introduced. 
Section 4 deals with condensed representations of sets of mining queries. In 
Section 5, we conclude the paper and we propose further research directions 
based on this work. 



2 Basic Definitions 

In our formalism, we assume that we are given: 

1. A set A of all data sets from which the patterns are to be discovered. For 
instance, A can be thought of as being the set of all instances of a given 
relation schema. 

2. A set of patterns L and a partial ordering ^ over L. Given two patterns 

, ip 2 in L, we say that Lp\ is more specific than ip 2 (or that 7 J 2 is more 
general than ipi) if we have 

3. A set of selection predicates Q, a selection predicate q & Q being a boolean 
function defined over L x A. Moreover, given a pattern 173 in L and a data set 
Z\ in A, we say that ip is interesting in A with respect to q if < 7 ( 173 , A) = true. 

4. A set of measure functions F, a measure function being a function defined 
from L X A to 3?. 



Now, we define when a selection predicate is independent from A. 
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Definition 1 - Data Independency. Let q be a selection predicate in Q. q is 
data independent (or independent for short) if there exists a function q from L 
to {true, false} such that for every data set A in A. and every pattern cp in L, 
q{(p,A) = q{ip). 

In our Running Example 1 , it is easy to see that the selection^ predicates 
mi, m2, oi and a) are independent. In the following, we denote by Q the set of 
all indep^dent selection predicates, and by Q the complement of Q in Q, i.e., 

Q = Q\Q. 

In this paper, we consider only selection predicates that are monotonic or 
anti-monotonic, and measure functions that are monotonic increasing. 

Definition 2 - Monotonicity. Let q be a selection predicate. 

— q is monotonic if for every data set A in A and every pair of patterns 
{ipi,P2) in'L?, we have: 

if ‘Pi A P2 and q{p2, A) = true, then q{p\, A) = true. 

— q is anti-monotonic if for every data set A in A and every pair of patterns 
{(fi,(p2) inlA, we have: 

if Pi A P2 and q{pi, A) = true, then q{p2, A) = true. 

Let f be a measure function, f is a monotonic increasing function if for every 
data set A in A and every pair of patterns {ip\,ip2) in lA , we have: 

if Pi A P2, then f{(pi,A) < f{(p2,A). 

In our Running Example I, it is easy to see that the selection predicates m* 
(i = 1 , 2 ) are monotonic, whereas the selection predicates of and di (i = 1 , 2 ) are 
anti-monotonic. Moreover, the measure function sup is an example of monotonic 
increasing measure function. 

In the following, we denote by A the set of all anti-monotonic selection predi- 
cates and by M the set of all monotonic selection predicates. Moreover, we denote 
by A (respectively M) the set of all selection predicates in A (respectively M) 
that are independent, and by A (respectively M) the set of all selection predi- 
cates in A (respectively M) that are not independent. Finally, we denote by I 
the set of all monotonic increasing measure functions. 

In our approach, selection predicates are compared according to the following 
definition. 

Definition 3 - Selectivity. Let q\ and q2 be two selection predicates. q\ is more 
selective than q2, denoted by qi G q2, if for every data set A in A and every 
pattern ip in L, we have: if q\{p, A) = true, then q2{p, A) = true. 

In the context of our Running Example 1 , let a\ and 02 be two support 
thresholds. For i = 1 , 2 , let be the selection predicate defined by: for every 
pattern p, qi{p,A) = true if sup{p,A) > Oj. It is easy to see that if 02 > cti, 
then <72 E 9 i- 
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In the rest of the paper, we consider a fixed data set Z\ in A. Therefore, 
for notational convenience, we shall omit A in the subsequent definitions and 
propositions. For instance, referring to the previous two definitions, q{<^, A) and 
f{(p,A) will be simply denoted by q{(p) and /((/?), respectively. 



3 Mining Query and Condensed Representations 

3.1 Basic Definitions 

In our approach, we define two types of mining query. 

Definition 4 - Mining Query. A simple mining query is a selection predicate 
q. Given a data set A, the answer of q in A, denoted by sol{q), is defined by: 
sol{q) = {(p G L I q{ip) = true}. 

sol{q) denotes the set of all interesting patterns in L with respect to q. 

An extended mining query is a pair {q, /) where q is a selection predicate and 
f is a measure function. Given a data set A, the answer of (q, /) in A, denoted 
by ans{q,f), is defined by: 

ans{q,f) = {{ip,f{(p)) \ ip G sol{q)}. 

Note that an algorithm proposed in [5] can compute directly sol{q) and 
ans{q, f) if q = m A a with m G M and a G A. 

Let Y = {{y\,yl, . ■ . , Vn) | z = 1, . . . ,p} be a set of tuples whose first elements 
are patterns in L. The projection of Y on L, denoted by tti,{Y), is defined by: 
T^hiY) = {yl,yl, . . . ,y}}. We note that 7 Tl(T) C L, and that for every q G Q 
and / G F, sol{q) = TT]L{ans{q, /)). 

We now introduce the notion of condensed representation. 

Definition 5 - Condensed Representation. Let Xi, . . . , Xk be sets of pat- 
terns, i.e., Xk C L (k = 1,...,K). Given a mining query q € Q and a 
data set A, {Xi, . . . , Xk} is a condensed representation of sol (q), denoted by 
Xi, . . . ,Xk h sol{q), if: 

— {Xi U . . . U Xk) Q sol{q), and 

~ there exists a function F independent from A such that: 
sol{q) = F{Xu...,Xk). 

Let Y be a set of pairs {ip, a) where p is a pattern in L and a is a real. Given 
an extended mining query {q, f) G Q xF and a data set A, {Xi , . . . , Xk,Y} is an 
extended condensed representation of ans{q,f), denoted by Xi, . . . ,Xk,Y \=e 
ans{q,f), if: 

— {Xi U . . . U Xk U 7Tl(T)) C 7TL(ans(g, /)), and 

— there exists a function F independent from A such that: 
ans{qj) = F{Xu...,Xk,Y). 

Given a simple mining query q and a measure function /, we now consider 
condensed representations of sol{q) and extended condensed representations of 
ans{q,f). 
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3.2 Maximal Patterns 

In this paper, we consider only simple mining queries that are defined by con- 
junction of anti-monotonic and monotonic selection predicates. In this case, the 
answer of a simple mining query can be represented by its most specific and 
most general patterns [9,12]. 

Definition 6. Let q = m A a be simple mining queries with m G M and a G A. 

— The set o/most specific patterns in sol{q), denoted by S{q), is defined by: 

S{q) = min^{sol{q)) = {(p € sol{q) \ G sol{q)){p' A p)}. 

— The set o/most general patterns in sol{q), denoted by G{q), is defined by: 

G{q) = max^{sol{q)) = {p G sol{q) \ G sol{q)){p A p')}- 

The following lemma, whose easy proof is omitted, shows that sol{q) can be 
computed from S{q) and G{q). 

Lemma 1. Let q = m A a be a simple mining query with m G M and a G A. 
We have: sol{q) = {v? G L | (3<^s G S{q)){3pg G G{q)){ps Ap A Pg)}- 

Therefore, we have the following proposition. 

Proposition 1. Let q = mAa be a simple mining query with m G M and a G A. 
The set {G{q), S{q)} is a condensed representation of sol{q), i.e., 

G{q), S{q) \= sol(q). 

Proof: LetF be the function defined by: F{Xi, X 2 ) = {p G L|(3<pi G Xi){3p2 G 
X 2 ){pi A p A p 2 )}- Using Lemma 1, we have sol{q)=F{S{q)^G{q)). Moreover, 
F is independent from the data set A since A does not depend on A. Finally, 
we have S{q) U G{q) C sol{q), which completes the proof. □ 

We point out that algorithms for computing S{m A a) and G{m A a) directly 
have been proposed recently, e.g. the level-wise version space algorithm in [6]. 

Example 1. Let qi and <72 be the simple mining queries as given in our Running 
Example 1. We recall that: G{qi) = {B}, S{qi) = {ABGE}, G{q 2 ) = {A}, and 
S{q 2 ) = {ABC, AGD}. Applying Proposition 1, we have: G{q\), S{q\) ^ sol{q\) 
andG{q2), S{q2) \= sol{q2) ■ □ 

It is important to note that G{m) |= sol{m), S{a) |= sol{a) and sol{m A 
a) = sol{m) n sol{a). Therefore, sol{q) can be computed from G{m) and S{a). 
However, the set {G(m), S{a)} is not always a condensed representation of sol{q), 
since we can have sol{q) C {G{m) U 5'(a)). This is in particular the case for a 
query q = m A a such that sol{q) = 0, sol{m) 0, and sol{a) 0. 

On the other hand, in [12], the authors consider what they call the positive 
and the negative borders of the answer to a mining query. In our approach, 
given a simple mining query q, the corresponding positive and negative borders, 
respectively denoted by Bd'^{q) and Bd~{q), can be defined as follows: 
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— Bd~^{q) = {S{q),G{q)}, where S{q) and G{q) have been defined previously 

— Bd~{q) = {S~{q),G~{q)}, where S~{q) and G~{q) are the following sets: 
S~{q) = max^{(p G sol(jn) \ (p ^ so/(a)} and G~{q)= min^{(p€sol{a) \ ip ^ 
sol{m)}. 

Therefore, according to Definition 5, the positive border can be seen as a 
condensed representation of sol{q), whereas the negative border can not. Indeed, 
although the sets S~{q) and G~{q) allow to recompte sol{q) without any access 
to the data set, the first point of Definition 5 is not satisfied, since neither S~{q) 
nor G~{q) is a subset of sol{q). 

We shall not consider the case of negative borders in the rest of the paper, but 
we note in this respect that (f ) storing Bd~ (q) is not optimal in general (since 
its cardinality can be much greater than that of sol{q)), and (ii) Bd~{q) can be 
seen in our approach as a condensed representation of the set sol{q) U Bd~{q). 

3.3 Closed and Key Patterns 

In this section, we give alternative definitions of the notions of closed and key 
patterns introduced in [3,4,16]. To this end, given a measure function /, we 
consider the partial ordering < y defined for every pair of patterns (<p, p') by: 

p<f p' if Pdi p' and f{p) = j{}p'). 



Definition 7. Let q he a mining query in Q and f he a measure function in F. 
Let A he a data set and p he a pattern in L. 

— The set of all interesting closed patterns in A with respect to q and f , denoted 
hy SG{q,f), is defined hy: 

SG{q, f) = min<j{sol{q)). 

— The set of all interesting key patterns in A with respect to q and f , denoted 
hy GK{q,f), is defined hy: 

GK(q, /) = max<j{sol{q)). 

It can be shown that our notions of interesting closed patterns and interesting 
key patterns coincide with those of [3,4,16] in the context of classical association 
rules mining [1]. 

Moreover, it is easily seen that for every extended mining query (g, /) with 
q €Q and / G I, we have S{q) C SC{q, f) and G{q) C GK{q, /). More precisely, 
the following lemma holds. 

Lemma 2. Let q he a selection predicate in Q and f he a monotonic increasing 
measure function in I. We have: 

S{q) = min^{SG{q, f)) and G{q) = maxp,{GK{q, f)). 

Proof: We first show that S{q) C min^{SG{q, f)). Let p G S{q). There does 
not exist a pattern p' G sol{q) such that p' < p. Therefore, there does not 
exist a pattern p' G sol{q) such that p' < p and f{p') = f{p), which shows 
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that (p € SC{q,f). Assume now that ip ^ min^{SC{q, f)). Then, there exists 
(p' € SC{q,f) such that <p' < (p, which is in contradiction with the hypothesis 
(p € S{q). Hence, we have S{q) C min^{SC{q, f)) . 

Now, we show that min^{SC{q, f)) C S{q). Let <p G min^{SC{q, f)). As- 
sume that (p ^ S{q). Then, there exists <p' G S{q) such that ip' A (p. Since it 
has been shown above that S{q) C min^{SC{q, f)), we have that ip' G SC{q,f). 
This is in contradiction with the hypothesis p> G min^{SC{q, f)). Hence, we 
have min^{SC{q, f)) C S{q). 

Thus the proof that S{q) = min^{SC{q, /)) is complete. In the same way, it 
can be shown that G{q) = max^{GK{q, f)) , which completes the proof. □ 

The following lemma states that given any pattern ip in sol{q), f{ip) can be 
computed based on SG{q,f) or GK{q,f). 

Lemma 3. Let q be a selection predicate in Q and f be a monotonic increasing 
measure function in I. For every interesting pattern ip in sol{q), we have: 

~ Ht) = rnax{f{ip') \ ip' G SG{q,f) and ip' ^ ip}, and 

- flip) = min{f{ip') I ip' G GK{q,f) and p ^ p'} 

where min and max denote respectively the minimum and maximum functions 
according to the standard ordering of real numbers. 

Proof: Let p G sol{q) and X{p) = {(/?' G sol{q) \ p' P and f{p') = f{p)}. 
Since p G X{p), we know that Y{p) = minp.{X{p)) is not empty. Given any 
p" G Y{p), assume that p" ^ SG{q,f). Then, there exists p' G sol{q) such that 
p' -< p" and f(p') = f{p"), which shows that p' G X{p) and contradicts the 
fact that p" is minimal in X{p). Hence, there exists pc G SG{q,f) such that 
Pc<p and f{pc) = f{p). 

On the other hand, for every p' G SG{q, f) such that p' ^ p, we have f(p') < 
f{p). Therefore, we have f{p) = max{f{p') \ p' G SC{q,f) and p' ^ p}. Since 
the fact that f{p) = min{f{p') \ p' G GK{q, f) and p ^ p'} can be shown in 
the same way, the proof is complete. □ 

Let {q, f) be an extended mining query. In the following, we denote by 
SC*{q,f) and GK*{q,f) the sets defined by: 

- SG*{q, /) = {{p, f{p)) I p G SC{q, /)}, and 

- GK*{q, f) = {(v., f{p)) I ^ G GK{q, /)}. 

The following proposition follows from the previous two lemmas. 

Proposition 2. Let q = m A a be a simple mining query with to G M, a G A, 
and let f be a monotonic increasing measure function in I. The sets {S{q), G{q), 
SC*{q,f)} and{S{q), G{q), GK*{q, f)} are extended condensed representations 
ofans{q, f), i.e., 

S{q), G{q), SG*{q,f) \=e ans{q, f) andS{q), G{q), GK*{q,f) ans{q,f). 
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Proof: Let F be the function defined by: F{Xi,X 2 , Y) = {{(p, a) G Lx5ft | G 
Xi)(3ip2 G X 2 ){pi FpF P 2 ) O'lT'd a = max{a' \ (3<^' G ,a') € Y A p' F 

p)}. Using Lemma 1 and Lemma 3, we have ans{q, f) = F{S{q), G{q), SC*{q, /)). 
Moreover, F is independent from the data set A since F does not depend on A, 
and S{q) U G{q) U SG{q,f) C sol{q). Therefore, {S{q), G{q), SG*{q,f)} is an 
extended condensed representation of ans{q,f). Since the fact that S{q), G{q), 
GK*{q,f) \=e ans{q, f) can be shown in the same way, the proof is complete. □ 

Example 2. Let qi be the simple mining query as defined in our Running Ex- 
ample 1. We can see that: 

GK* {qi, sup) = {(5,0.5), (5C, 0.4)} and 
SG*lqi,sup) = {IaBE,0.5),{ABCE,0.4:)}. 

Recalling that G{qi) = {5} and S{qi) = {ABCE}, and using Propo- 
sition 2, we obtain that S{qi), G{qi), SC* {qi, sup) \=e ans{q\, sup) and that 
S{qi), G{qi), GK*{qi,sup) ans{qi,sup). □ 

4 Condensed Representations of Sets of Mining Queries 

In this section, we extend the notions of condensed representation and of ex- 
tended condensed representation to the case of sets of mining queries. 

4.1 Definitions 

Definition 8 - Set of Mining Queries. Let Q = {< 71 , . . . , be a set of 

mining queries. Given a data set A, the answer of Q in A, denoted by sol{Q), 
is the set defined by: 

sol{Q) = y {{p, q) \ p & sol{q)}. 
q&Q 

Let f be a measure function in F. The answer of (Q, /) in A, denoted by 
ans{Q,f), is the set defined by: 

ans{Q,f)= [j{{p,q,f{p)) \ p G sol{q)}. 
q&Q 

Definition 9 - Condensed Representation. Let Xi,. . . ,Xk be sets of pairs 
{p, q) where G L and q G Q. Given a set of mining queries Q and a data set A, 
{Xi , . . . , Xk} is a condensed representation of sol{Q), denoted by X\, . . . , Xk H 
sol{Q), if: 

- 7Tl(A’i) U . . . U 7TL(dfK) c 7 Tl(soI(Q)), and 

— there exists a function F independent from A such that: 
sol{Q) = F{Xu...,Xk). 

Let Y be a set of pairs {p, a) where p is a pattern in L and a is a real. 
Given a set of mining queries Q, a measure function f and a data set A, 
{Xi, . . . ,Xk,Y} is an extended condensed representation ofans{Q,f), denoted 
by Xi,..., Xk, Y |=e ans{Q, f), if: 
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- 7Tl(A'i) U . . . U 7Tl(A’k) U 7Tl(F) C 7TL(ans(Q, /)), and 

— there exists a function F independent from A such that: 
ans{QJ) = F{Xu---,Xk,Y). 

Let C = {Zi, . . . , Zk} and C = {Z[, . . . , Z^} be two condensed representa- 
tions (extended or not) having the same cardinality K. We say that C is more 
concise than C' if there exists a permutation 9 of {1, . . . , AT} such that for every 
i = 1, . . . , K, Zi C . 

Given a set of mining queries Q and a measure function /, we study condensed 
representations of sol{Q) and extended condensed representations of ans{Q, /). 

4.2 Maximal Patterns 

Given a set of mining queries Q = {q \, . . . , g„}, it is well known [9] that, although 
{S{qi), G{qi)} is a condensed representation of sol{qi), for every i = 1, . . . ,n, 
the set {S{qi)LI. . .US'(( 7 „), G(gi)U. . .UG(< 7 „)} is not a condensed representation 
of sol{qi) U . . . U sol{qn)- 

However, if for every ip in S{qi) U . . . U 5'(g„) or in G{qi) U . . . U G{qn), we 
keep track of the query qi the pattern p comes from, then sol{Q) and ans{Q, f) 
can be condensed. For this reason, we define the sets 5(Q) and G{Q) as follows: 

Definition 10. Let Q = {gi, . . . , g„} be a set of mining queries qi G Q (i = 
The sets S{Q) and Q{Q) are defined by: 



5(Q) = U {('F’'?) I -F e ^(g)} and g{Q) ^ {{p,q) \ p G G{q)}. 

geQ q&Q 

Given these definitions, we have the following proposition. 

Proposition 3 . Let Q = {qi, . . . ,g„} be a set of mining queries qi = mi A Oj 
with mi G M. and Oi G A (i = l,...,n/ The set {5(Q), G{Q)} is a condensed 
representation ofsol{Q), i.e., S{Q), G(Q) \= sol{Q). 

Proof: Let F be the function defined by: F{Xi,X 2 ) = {{p^q) G L x Q | 
(3((/?i,gi) G Xi){ 3 {p 2 ,q 2 ) G X 2 ){qi = q 2 = q and pi < p < ^ 2 )}- Based on 
Lemma 1, we can easily see that sol{Q) = F {S (Q) , G (Q)) ■ Moreover, F is in- 
dependent from the data set A since F does not depend on A. Finally, we have 
5(Q) C sol{Q) and G{Q) G sol{Q), which completes the proof. □ 

Example 3 . Ln the context of our Running Example 1, let q^ = m 3 A 03 and 
(74 = TO4 A 04 where m 3 , m 3 , 03 and 04 are selection predicates defined for every 
pattern p G'h by: 

— m 3 {p, A) = true if AC p, and m 4 {p, A) = true if AG C p, 

— a 3 {p,A) = true if sup{p,A) > 0.4 and p C ABC, and a 3 {p,A) = true if 
sup{p, A) > 0.3 and p C ABCD. 
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We note that m3 and are monotonic selection predicates such that 7714 □ m3, 
whereas 03 and 04 are anti-monotonic selection predicates such that 03 □ 04. 
We can see that S{q3) = {ABC}, S{q4) = {ABC, ACD}, C{q3) = {A} and 
G{q4) = {AC}. Considering Q = {93,(74}, we have: 

S{Q) = {{ABC,q3),{ABC,q3),{ACD,q3)} and G{Q) = {(A, (73), (AC, (74)} 
Using Proposition 3 , we can see that: S{Q), G{Q) \= sol{Q). □ 

In what follows, we show how to define condensed representations of sol{Q) 
that are more concise than |5(Q), G{Q)}. 

Let Q = {qi, . . . , q„} be a set of mining queries qt = mi A Oi with Wi G M 
and Oi G A (f = 1 , . . . ,n). We define two partial pre-orderings, denoted by <a 
and <M, as follows: for all {(pi, qt) and {ipj, qj) in L x Q: 

<A (Pj,qj) if Pi A Pj and a* C Oj 
{pi,qi) <M iPj,q]) if Pi A Pj and mj C Wj. 

Then, we denote by S{Q) the set of all minimal pairs in 5(Q) with respect 
to <A- Similarly, we denote by P{Q) the set of all maximal pairs in G{Q) with 
respect to <m- That is: 

S{Q) = min<j,{S{Q)) and P{Q) = max<„{G{Q))- 

The following lemma states that, for every q € Q, sol{q) can be computed 
based on S{Q) and B{Q), only. 

Lemma 4. Let Q = {qi, . . . ,qn} be a set of mining queries qi = mi A Oj with 
mi G M and Oi G A (i = 1 , . . . ,n). For every q in Q, we have: 

sol{q) = G L I (3(<pi, qi) G E{Q)){{(pi, qf) <a {p, q)) and 
{ 3 {pj,qj) G r{Q)){{p,q) <m {pj,qj))}. 

Proof: Let x{q) he the set defined by: 

X{q) = G L I { 3 {ipi,qi) G E{Q)){{ipi,qi) <a (p,q)) and 
{ 3 {pj,qj) G r{Q)){{ip,q) <m (pj,qj))}. 

We first show that X{q) C sol{q). Let (p € X{q). There exist {pi, qf) € E{Q) 
and (pj,qj) G F{Q) such that {pi,qi) <a (p,q) and (p,q) <m iP],qj)- 

On one hand, we know that qi{pi) = true. Thus, we have ai{pi) = true. Lt 
follows that a(pi) = true since Oi C a, and that a{p) = true since pi A P and a 
is anti-monotonic. 

On the other hand, we know that qj{pj) = true. Thus, we have mj{pj) = 
true. Lt follows that m{pj) = true since mj C m, and that m{p) = true since 
P A Pj and m is monotonic. Therefore, we have a{p) = true and m{p) = true, 
which shows that p G sol{q). Hence, we have: X{q) C sol{q). 

Now, we show that sol{q) C X{q). Let p G sol{q). There exist ps G S{q) 
and pg G G{q) such that Ps A P A Pg- Thus, we have (ps,q) G 5(Q) and 
(Pg,q) G ^^(2)- 
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Given the definitions of S{Q) and r{Q), there exist (ipi,qi) € ^(Q) and 
G r{Q) such that {(pi,qt) <a i^s,q) and (ipg,q) <m Moreover, 

we have {(fis,q) <a since (ps ^ (fi, and {p,q) <m since (p fi, Pg. 

Thus, (pi,qi) <A (p,q) and (p,q) <m which shows that p G X{q). 

Hence, we have sol{q) C X{q), which completes the proof. □ 

As a consequence of Lemma 4 above, we have the following theorem: 

Theorem 1. Let Q = {q \, . . . , g„} he a set of mining queries qi = mi A Oj with 
mi G M and ai € A (i = The set {X{Q), T(Q)} is a condensed 

representation ofsol(Q), i.e., X{Q), T{Q) \= sol{Q). 

Moreover, {E{Q), T(Q)} is more concise than {5(Q), G{Q)}. 

Proof: Let F be the function defined by: F{Xi,X 2 ) = {{p,q) G L x Q | 
(3(‘/5 i,9i) G Xi){{pi,qi) <A {p,q)) and {3{p2,q2) G X 2 ){{p,q) <m (T 2 ,q 2 ))})- 
Using Lemma we can easily see that sol{Q) = F{E{Q), F{Q)). Moreover, F 
is independent from the data set A since F and C do not depend on A. 

It is easily seen that we have E{Q) C S{Q) C sol{Q) and F{Q) C G{Q) C 
sol{Q). Therefore, {E{Q), F{Q)} is more concise than {5(Q), G{Q)} and thus, 
the proof is complete. □ 

Example 4. We recall from Example 3 that we have: 

S{Q) = {{ABC,q^),{ABC,q^),{ACD,q^)} andg{Q) = {{A, q^) , {AC , qi)} . 
Since 03 G 04, we have {ABC,qfi) <a (ABC, (74) . On the other hand, {A,qfi) and 
(AC, (74) are not comparable with respect to < m - It follows that: 

A(Q) = {(ABC, (73), (ACB, (74)} and B(Q) = {(A, 93), (AC, (74)} 

Using Theorem 1, we can see that E{Q), F{Q) |= B(Q). Moreover, since E{Q) C 
S{Q) and F{Q) C GiQ), |^(Q), A(Q)} is more concise than 

|5(Q), g{Q)}. □ 

We end this subsection by showing how to optimize the computation of E{Q) 
(respectively F{Q)) by stating that two pairs (pi,qi) and {pj,qj) in B(Q) (re- 
spectively g{Q)) cannot be comparable with respect to <a (respectively <m) if 
Pi Pj- 

Indeed, based on this result, it turns out that the computation of E{S) = 
min<^{S{Q)) (respectively F{S) = maa;<„(5(Q))) only requires to compare the 
pairs of B(Q) (respectively Q{Q)) that contain the same pattern. 

Proposition 4. Let Q = {< 71 , . . . ,( 7 „} be a set of mining queries qt = A Oi 
with mi G M and Oi G A (i = 1, . . . ,n). 

If {pi,qi) and {pj,qj) are two pairs in S{Q) (respectively g{Q)) such that 
(y’ij'Zi) Ea (respectively such that {pi,qi) <m (Tj^qj))^ then we have 

Pi = Pj ■ 

Proof: Let (pi,qi) and {pj,qj) be two pairs in S{Q) such that {pi,qi) <a 
{pj,qj). Since qfipi) = true, we have afipi) = true and Uj{pi) = true since 
Qi C Qj. On the other hand, since qj{pj) = true, pi F Pj and mj is monotonic. 
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we have rrij{ipj) = true and rrij{(fi) = true. Therefore, we have gji^Pi) = true, 
meaning that ipt G sol{qj). Moreover, since ipj is minimal in sol{qj) with respect 
to < and ipi ^ (fij , we necessarily have ipi = ipj . It can he shown in the same way 
that if {ipi,qi) and {(pj,qj) are two pairs in G{Q) such that (ipi,qi) <m {pj,qj), 
then ifi = (fij. Thus the proof is complete. □ 

4.3 Closed and Key Patterns 

In this subsection, we consider the case of extended condensed representations 
of a set Q = {qi, . . . , qn} of simple mining queries with qi G Q {i = 1, ... ,n) 
involving a monotonic increasing measure function / in I. To this end, recalling 
that SC{qi,f) is the set of all interesting closed patterns in A with respect to 
qi and / (t = 1, . . . , n), we define the sets SC(Q, f) and SC*{Q, f) as follows: 

SC{Q,f)=min<^{[j SC{q,f)) and SC* {Q, f) = {{p, f{p)) \ p G SC{Q, f)} 

qeQ 

Example 5. Let Q = {( 71 ,( 72 } be the set of simple mining queries as defined in 
our Running Example 1. We have: 

- SC{qi,f) = {ABCE, ABE} and SC{q 2 , /) = {ABC, ACD, AB, AC, A}, 

- SC{Q,f) = [ABCE, ABE, ACD, AC, A] and 

- SC*{Q, f) = {{ABCE, 0.4), {ABE, 0.5), {ACD, 0.3), {AC, 0.5), {A, 0.8)}. □ 
Based on Lemma 3, we can state the following proposition: 

Proposition 5. Let Q = {qi, . . . , qn} be a set of simple mining queries with 
qi G Q (i = 1, ... ,n) and f he a monotonic increasing measure function in I. 
For every i = 1, . . . ,n and p G sol{qi), we have: 

f{p) = max{f{p') I p' G SC{Q,f) and p' A p}- 

Proof: Let pi in sol{qi). Using Lemma 3, we know that: 

f{pi) = max{f{p'i) I p'i G SC{qi,f) and p{ A Pi} 

Let p'i G SC{qi, /) such that p'i A Pi and f{p'i) = f{pi). Given the definition 
of SC{Q, f), there exists p'^ G SC{Q, f) such that p'^ </ p'i, i.e., p} A p{ and 
f{p'j) = f{Pi). Thus, there exists p} G SC{Q, f) such that (/?' ^ pi and f{p'j) = 
f{pi). Finally, for every p' G SC{Q, /) such that p' A Pi, we have f{p') < f{pi) 
since f is a monotonic increasing function. It follows that: f{pi) = f{p}) = 
max{f{p') I p'i G SC{Q,f) and p' A Pi} which completes the proof. □ 

The same idea applies for key patterns. Recalling that GK{qi, f) is the set 
of all interesting key patterns in A with respect to qi and / {i = 1, ... ,n), we 
define the sets Q/C{Q,f) and Q/C*{Q,f) by: 

g/C{Q, f) = max<j{{j GK{q, /)) and giC*{Q, f) = {{p, f{p)) \ p G giC{Q, /)} 
q&Q 

The following proposition states how to compute f{p) based on the set 

gic{Qj)- 
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Proposition 6. Let Q = {qi, . . . , < 7 „} be a set of simple mining queries with 
Pi € Q (i = 1, ... ,n) and f he a monotonic increasing measure function in I. 
For every i = 1, . . . ,n and ip € sol{qi), we have: 

/((/?) = min{f{ip') I if' G GIC{Q, f) and ip < p'}. 

Proof: The proof uses similar arguments as that of Proposition 5, and thus is 
omitted. □ 

Using propositions 5, 6 and Theorem 1, the following theorem holds. 
Theorem 2. Let Q = {qi, . . . , g„} be a set of mining queries with q^ = mi A Oi 
where G M and Oi G A (i = l,...,n). Let f be a monotonic increasing 
measure function in I. 

The sets {S{Q), r(Q), SC*{Q,f)} and {S{Q), F{Q), GIC*{QJ)} are ex- 
tended condensed representations of ans{Q,f), i.e., 

U(Q), r(Q), SC*{Q,f) Kans(Q,/), and 
U(Q), r(Q), GK.*{Q,f) Kans(Q,/). 

Proof: Let F he the function defined as follows: for every triple (p,q,a) G 
L X Q X 3?, (p,q,a) G F{XuX 2 ,y) if: 

- there exists (pi,qi) G Xi such that (pi,qi) <a (p,q), and 

- there exists (p 2 ,<l 2 ) € ^2 such that (p,q) <m {^ 2 ,^ 2 ), and 

- a = max{a' \ {3p' G L)(((^',a') G y and p' -< p)}. 

Using Theorem 1 and Proposition 5, we can easily see that ans{Q, f) = 
F{S{Q), F{Q), SC*{Q, f)). Moreover, F is independent from the data set A 
since A and U do not depend on A. Finally, we have S{Q) C S{Q) C sol{Q), 
F{Q) C g{Q) C sol{Q) and 7Tl(5C*(Q, /)) = SC{Q,f) C tti^{soI{Q)), which 
shows that S{Q), F{Q), SC*{Q,f) \=(, ans{Q,f). Using Theorem 1 and Propo- 
sition 6, it can be shown in the same way that U(Q), F{Q), QK,*{Q,f) \=e 
ans{Q,f), thus the proof is complete. □ 

Example 6. Let Q = {< 71 ,( 72 } be the set of simple mining queries as defined in 
our Running Example 1. We recall from examples 1 and 5 that: 

- S{qi) = {ABCE}, S{q 2 ) = {ABC, AC D}, G{qi) = [B] and G{q 2 ) = {A}, 

- SClqi,f) = {ABCE, ABE} and SC{q 2 ) = {ABC,ACD,AB,AC,A}, 

- SC{Q,f) = {ABCE, ABE, ACD, AC, A}, and 

- SC*{Q, f) = {{ABCE, 0.4), {ABE, 0.5), {ACD, 0.3), {AC, 0.5), {A, 0.8)}. 

Therefore, S{Q) = {{ABCE,qi),{ABC,q 2 ),{ACD,q 2 )} and G {Q) = {{B , qi) , 
{A, (72)}. Since 5(Q) (respectively G{Q)) contains no pairs comparable with respect 
to <A (respectively <m}, we have S{Q) = S{Q) (respectively F{Q) = G{Q))- 
Then using Theorem 2, we know that {E{Q), F{Q), SC*{Q,f)} is an ex- 
tended condensed representation ofans{Q,f), i.e., E{Q), F{Q), SC*{Q,f) \=e 
ans{Q,f). Moreover, we note that SC*{Q, f) C SC{qi, f)U SC{q 2 , f). □ 

The previous example shows a case where the two condensed representations 
|5(Q), G{Q)} and {E{Q), F{Q)} of sol {Q) are equal. In the next subsection, 
we show that these condensed representations can be made more concise under 
additional hypotheses that are satisfied in the traditional case of association 
rules mining [1]. 
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4.4 Further Improvement 

We assume now that every query q £ Q is of the form q = q A q where q 
is an independent selection predicate. Intuively, in order to further condense 
{S{Q), F(Q)}, we compare queries based on their ‘non-independent parts,’ since 
their ‘independent parts’ can be evaluated without considering the underlying 
data set. 

To this end, given a set of mining queries Q = {qi, . . . ,qn} where qi = ^ A ^ 
with qi € Q and qi G Q, we define two partial pre-orderings, denoted by and 
<g, as follows: for all (ipi,qi) and (ipj,qj) in L x Q: 

Qi) <A if Pi ^ Pj and of E oj 

(Pi, q%) <M (Pj^<l]) if Pi ^ Pj and rnj QWh- 

Then, we introduce the following notations: 

T’(Q) = min<-{S{Q)) and T(Q) = max<-{G{Q))- 

The following lemma states how, for every q in Q, sol{q) can be computed 
based on S{Q) and T(Q), assuming that the independent part g of g is known. 

Lemma 5. Let Q = {qi , . . . , be a set of mining queries qi = qi A qi where 

(fi = WCi A ai with rni G M, ai G h, and (fi = fui A Si with rui G M, Si G A 

(i = 1, . . . , nj. For every q in Q, we have: 

sol{q) = {if G sol{q) \ (3{ipi,qi) G I^{Q)){{ipi,qi) {(p,q) and 

{^{Pj,qj) G r{Q)){{(p,q) {ipj,qj))}. 

Proof: Let x{q) he the set defined by: 

X{q) = {(fi G sol{^ I {^{Pi,q^) & ^{Q)){{Pi,qi) <a(p^(i) 

{3{pj,qj) G r{Q)){{ip,q) (ipj,qj))}. 

We first show that X{q) C sol{q). Let (p G X{q). There exist (ipi,qi) G E{Q) 
and {(Pj,qj) G F{Q) such that {(pi,qt) {(fi,q) and {ip,q) {Pj,qj)- 

On one hand, we know that qi{ipi) = true. Thus, we have ai{(pi) = true. Lt 
follows that d{ipi) = true since Hi \Za, and so, d{ip) = true since ipi P P and d 
is anti-monotonic. 

On the other hand, we know that qj{p>j) = true. Thus, we have fnj{(pj) = 
true. Lt follows that rn{(pj) = true since rnf E hri, and so, rri{ip) = true since 
P P Pj and m is monotonic. Therefore, we have q{p>) = true. Since (p G solfq), 
we have q{ip) = q{p>) A q{(f) = true, which shows that X{q) C sol{q). 

Now, we show that sol{q) C X{q). Let ip G sol{q). There exist ips G S{q) 
and (pg G G{q) such that <Ps P P P Pg- Moreover, we have {(ps,q) G S{Q) and 
{Pg,q)&G{Q)- _ _ _ 

Given the definitions of X{Q) and T{Q), there exist {ipi,qi) G X{Q) and 
(Pj,qj) G r{Q) such that {(pi,qi) <a (Ps,q) and (p>g,q) <g (ipj,qj). Moreover, 
we have {(ps,q) <a (P^q) since (ps P p), and {(p,q) <p- {Tg^q) since ip P ipg. 
Thus, (ipi,qi) <A (p,q) and_{ip,q) {(pj,qj). As ip G sol{q) md as sol{q) C 
solfq), it follows that ip G X{q), which entails that sol{q) C X{q). Thus, the 
proof is complete. □ 

Based on Lemma 5 above, we can state the following theorem. 




266 



A. Giacometti et al. 



Theorem 3. Let Q = {gi, . . . , q„} be a set of mining queries qi = qi /\ qi where 
qf = A Oi with rfii G M, ai G A, and qi = fui A Si with rui G M, Si G A 
(i = 1, . . . ,n). The set {i7(Q), T(Q)} is a condensed representation of sol{Q), 
i.e., r(Q), T{Q)^sol{Q). 

Proof: Let us consider the function F defined by: 

F(Ti,T2) = { (<p, (?) G L X Q I ip G sol{q} and 

{3{(fii,qi) G Xi){{(fii,qi) <j {(p,q)) and 
(3(v32,92) G -^ 2 )((<P,<?) <m(V2,92))}) 

Using Lemma 5, we can easily see that sol{Q) = F (S (Q) , F (Q)) . Moreover, 
F is independent from the data set A since F and G do not depend on A. Finally, 
for every pair (p,q) in S{Q) or F{Q), we know that (p,q) G sol(Q). Thus, we 
have 7rL(A'(Q) U P(Q)) C 7 Tl(so/(Q)), which completes the proof. □ 

Unfortunately, as shown in the following example, the condensed representa- 
tions {S{Q), U(Q)} and {S{Q), U(Q)} are not comparable in general. In- 
tuitively, this is due to the fact that {(pi,qi) <a (<^2,92) can hold whereas 
(TIiQi) (¥’2,92) does not, or conversely. 

Example 7. Ln the context of our Running Example 1, let (75 = TO5 A 05 and 
qe = tue A uq where ms, me, 05 and ae are defined for every v? G L by: 

— ms (173, A) = true if sup{ip, A) < 0.8 and A C p, 

~ me{p,A) = true if sup{p,A) < 0.9 and AC C p, 

— ae{p, A) = true if sup{p, A) > sup{AB, A) and p C AC, 

— ae{p, A) = true if sup{p, A) > sup{AC, A) and p C ABC. 

We note that ms and me are monotonic, whereas os and og are anti-monotonic. 
Moreover, we can see that S{qe) = {AC}, S{qe) = {AC}, G{qe) = {A} and 
C{qe) = {AC}. 

Now, considering Q = {qe,qe}, we have: 5(Q) = {{AC,qe),{AC,qe)} and 
G{Q) = {{A,qe), {AC,qe)}. Moreover, we have {AC,qe) <a {AC,qe), whereas 
(A, (7s) (AC,qe) are not comparable with respect to <m- Therefore, we have: 

S{Q) = {(AC, (7s)} and F{Q) = {(A, <7s), (AC, (75)}. 

On the other hand, {AC,qo) {A,q^), whereas (AC, (7s) and (AC, (75) are 
not comparable with respect to <^. Therefore, we have: 

E{Q) = {(AC,gs),(AC, q^)} andT{Q) = {{A,q,)}. 

Hence, we have E{Q) C S{Q) and F{Q) C F{Q), which shows that {E{Q), 
C(Q)} and {E{Q), C(Q)} are not comparable. □ 

The following lemma states a sufficient condition when {E{Q), C(Q)} is more 
concise than {E{Q), C(Q)}. Intuitively, according to this condition, the anti- 
monotonic (respectively monotonic) queries to be considered must satisfy the fact 
that if two queries are comparable, then their dependent part are comparable as 
well. 




Condensed Representations for Sets of Mining Queries 267 



Lemma 6 . Let Q = {qi , . . . , be a set of mining queries qi = qi /\ qi where 
cfi = frii A Oi with rnTi G M, ai G A, and (fi = mt A Ui with rui G M, Si G A 
(i=l,...,n). 

If for every (ai,aj) G such that Oi Q aj, we have ai Q aij, and for every 
(mijiTij) G such that mi □ rrij, we have rnl □ fnj, then {S{Q), -T(Q)} is 

more concise than {S{Q), A(Q)}. 

Proof: Assume that for every (oj, aj) G such that Oj □ Oj, we have al □ cTf- 
Then, for all pairs {(pi,qi) and {(fi2,q2) in S{Q) such that (ipi,qi) <a i^2,q2), 
we also have {ip\,qi) (7>2,<Z2)- Hence, we have S{Q) C S{Q). In the same 
way, we can see that if for every (mi,mj) G such that mi Q mj, we have 
frii C mf, then T{Q) C A(Q). Thus, the proof is complete. □ 

In what follows, we identify a case where the previous lemma applies. This 
case makes use of the notion of dense measure function, defined by: 

Definition 11 . Let f be a measure function defined over A CiR. We say that f 
is dense in A with respect to L, if for every pair (Ai, A2) G such that Ai < A2 
and every pattern (p G L, there exists a data set A such that Ai < f{ip, A) < A2. 

Then, we have the following. 

Proposition 7 . Let f be a increasing measure function defined from L x A over 
A C 3 ? such that f is dense in A with respect to L. 

Let Qj: = Af UM.f where Af = {oa | A G A} and M/ = {rhx | A G A} are 
two sets of selection predicates defined by: for every data set A and every pattern 
G L, A) = true if f{(p, A) > X, and A) = true if f{(p, A) < A. 

Led A and M be two sets of independent selection predicates such that for every 
d in A (respectively fh G MJ, a is anti-monotonic (respectively rh is monotonic) 
and sol{d) A 0 (respectively sol{m) yf %). 

Let Q = {qi, . . . ,<7„} be a set of mining queries qi = qiAqi where qf = frii Add 
with ffii G My, di G Af, and qi = frii A <Ti with ffii G M, Si G A (i = 1 , . . . ,n). 
Then, {A 7 (Q), T'(Q)} is more concise than {E{Q), T(Q)}. 

Proof: Using the notation of the proposition, based on Lemma 6 , we have to 
show that for every i,j = {!,..., n}, if Ui Q aj, then dd Q of and that if 

mi C mj, then fnd E rnf- 

Assuming that Oi E <ij and dd ^df implies that there exist two reals Xi and 
Xj such that for every pattern 1^ G L and every data set A, di((p,A) = true if 
f{ip,A) > Xi and dj((p,A) = true if f{ip, A) > Aj. //of E af, we necessarily 
have Xi < Xj. 

Moreover, given a pattern ip G sol{Si), there exists a data set A such that 
Xi < f{(p,A) < Xj. Then, we have p G sol{ai/A) and ip E sol{aj/A), which 
contradicts the hypothesis Ui E Oj • 

Using similar arguments as above, it can shown that if mi E nij, then fnd E 
ffij, which completes the proof. □ 

Now, we note that the previous proposition applies in the traditional case of 
association rules where L = \ { 0 , Items} and the measure function is the 
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function sup. Indeed, it is easy to see that the function sup is dense in [0, 1] with 
respect to L = \ {0, Items}. 

The following example shows how Proposition 7 applies in the context of our 
Running Example 1. 

Example 8. Let Q = { 91 ,( 72 } be the set of simple mining queries qi = A Oj 
where mi and Oj (i = l,2) are defined in our Running Example 1. 

We recall from Example 6 that S{Q) = {{ABCE,qi),{ABC,q 2 ),{ACD,q 2 )} 
and G{Q) = {{B,qi), {A,q 2 }}. Moreover, we also recall that E{Q) = S{Q) and 

nQ) = G{Q)- 

Since ABC C ABCE (ABCEAABC ) and Oi G 02 , we have {ABCE, qi) 
{ABC,q 2 ). Thus, the pair {ABC,q 2 ) does not belong to E{Q). Hence, we have 

E{Q) = {{ABCE,q^),{ACD,q^)}. 

Then, since the pairs (B,qi) and (A, ( 72 ) are not comparable with respect to 
<g, we have E{Q) = G{Q). In conclusion, using Theorem 3, we can see that 
E{Q), E{Q) \= sol(Q). Moreover, since E{Q) C E{Q) and E{Q) C T{Q), it is 
easy to see that {T'(Q), T(Q)| is more concise than {E{Q), T(Q)|. □ 

Finally, regarding extended condensed representations, we can easily prove 
the following theorem, based on propositions 5 and 6 and on Theorem 3. 
Theorem 4. Let f be a monotonic increasing measure function in I and Q = 
{qi, . . . ,< 7 „| be a set of mining queries qi = qi A qg where qi = rni A ai with 
fni € M, cii € A, and qfi = mii Aaii with e M, G A (i = 1, . . . ,n). The sets 
(E{Q), T{Q), SC*{Q,f)} and (E{Q), T{Q), GIC*{Q, f)} are extended con- 
densed representations of ans{Q, f) , i.e., E{Q), E{Q), SC*{Q,f) \=e ans{Q, f) 
andE{Q), T{Q), GK.*{Q,f) ^,ans{Q,f). 

5 Conclusion 

In this paper, we have considered the problem of defining condensed represen- 
tations of sets of mining queries. To this end, we have first studied the case of 
a single mining query and we have extended previous works on version spaces 
by [9] so as to take into account the presence of measure functions in the query. 
This has been done based on the well known notions of closed and key pat- 
terns ([3,16]). Then, we have seen how to extend this approach to sets of mining 
queries. The main idea in this extension is that, in order to obtain condensed 
representations in this case, when storing a pattern, one must keep track of the 
query the pattern comes from. 

Based on this work, we are currently investigating how condensed representa- 
tions can be used to optimize the iterative computation of the answer of mining 
queries. This problem has been studied for standard association rules [2,10,13,14] 
and multi-dimensional association rules [8,15]. In our framework, this problem 
can be stated as follows: given a data set A, a set Q = {( 71 , . . . ,q„} of mining 
queries and a new extended mining query (q, /): 

1. How to optimize the computation of ans{q, /) using the extended condensed 
representations of ans{Q, f)l 
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2. How to efficiently modify the extended condensed representation of ans{ Q, /) 
so as to obtain an extended condensed representation of ans{Qil {g}, /)? 

Moreover, it is clear that some tests are necessary to compare the various con- 
densed representations proposed in this paper. To this end, we are implementing 
our approach in the context of our previous work [8], where mining queries are 
composed through relational operators. We also investigate how our approach 
can be used to optimize the iterative computation of iceberg cubes [11]. 
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Abstract. Instance retraction is a difficult problem for concept learn- 
ing by version spaces. This chapter introduces a family of version-space 
representations called one-sided instance-based boundary sets. They are 
correct and efficiently computable representations for admissible concept 
languages. Compared to other representations, they are the most efficient 
usefuk version-space representations for instance retraction. 



1 Introduction 

Currently, there is a renewed interest in version spaces caused by their applicabil- 
ity in inductive databases [5,6]. This chapter considers version spaces when the 
inductive query constraints are instances of a concept to be learned; i.e., the task 
is essentially a concept-learning task. In this context we study two important 
problems of inductive databases: the problem of efficiently representing version 
spaces and the problem of efficiency of version spaces for instance retraction. 

Mitchell defined version spaces as sets of concept descriptions that are con- 
sistent with training data [7,8]. Version-space learning is an incremental process: 

~ If an instance i is added, the version space is revised so that it consists of all 
the concept descriptions consistent with the processed training data plus i. 

— If an instance i is retracted, the version space is revised so that it consists 
of all the concept descriptions consistent with the processed training data 
minus i. 

For the learning processes version spaces are represented. The standard rep- 
resentation is the boundary-set representation [7,8]. It is correct for the class 
of admissible concept languages [7], but its size can grow exponentially in the 
size of training data [1]. To overcome this problem alternative version-space rep- 
resentations were introduced [2,3,4,9,10,11,12,13]. They extended the classes of 
concept languages for which version spaces are efficiently computable. 

However a remaining problem for most version-space representations is that 
they are inefficient for instance retraction, since they lack a structure that de- 
termines the influence of an individual training instance. Hence, if a training 
instance is retracted, the representations are recomputed [11]. To avoid this 
problem two version-space representations were proposed. The first one is the 



1 



In this chapter, the notion useful has a technical meaning defined in subsection 2.2. 
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training-instance representation [3]. By its definition it is efficient for instance 
retraction. However, the representation has only a theoretical value, since the 
classification of each instance requires search in the concept language using all 
the training data. The second representation is instance-based boundary sets 
(IBBS) [11,12]. It is correct and efficiently computable for the class of admissi- 
ble concept languages. The instance-retraction algorithm of the IBBS is efficient 
and it does not recompute the representation. At the moment the IBBS is the 
most efficient useful version-space representation for instance retraction. 

In this chapter we address the question whether it is possible to design new 
version-space representations that are even more efficient than the IBBS in terms 
of computability and instance retraction. To answer the question we introduce a 
family of version-space representations called one-sided instance-based boundary 
sets. The family consists of two dual representations: instance-based maximal 
boundary sets (IBMBS) and instance-based minimal boundary sets (IBmBS). 
Without loss of generality we consider in detail only the IBMBS representation. 

The course of the chapter is as follows. In section 2 we formalise the necessary 
basic notions. They are used in section 3 to define the IBMBS representation. 
There, we prove that the representation is correct for the class of admissible 
concept languages and derive the conditions for finiteness. Section 4 presents 
four IBMBS algorithms for instance addition, instance retraction, version-space 
collapse and instance classification. It is shown that the IBMBS can be used for 
instance classification in the presence of noisy training data. In sections 5 and 6 
we provide an analysis and an evaluation of usefulness of the IBMBS. The dual 
representation of the IBMBS, instance-based minimal boundary sets (IBmBS), 
is touched upon in section 7. We compare the new representations with relevant 
work in section 8. Finally, in section 9 conclusions are given. 



2 Formalisation 

This section formalises the necessary basic notions. In subsection 2.1 we for- 
mulate the concept-learning task. Then, in subsection 2.2, we introduce version 
spaces as a solution of the task and we consider the notion of version-space rep- 
resentations together with their characteristics of usefulness. In this context we 
present the class of admissible concept languages in subsection 2.3. 

2.1 The Concept-Learning Task 

Concept learning assumes the presence of a universe of all the instances [11]. 
Notation 1. The universe of all the instances is denoted by I. 

Definition 2 (Concept). A concept c is subset of I: c C I. 

Given a concept, there exist two types of instance sets. 

Definition 3 (Set of Positive/Negative Instances). A set C I is a set 

of positive instances of a concept c C I if and only if I~^ C c. A set I~ Q I is a 
set of negative instances of a concept c C I if and only z/ n c = 0. 
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The set of all the concepts defined on the universe / is the power set P{I). 
To represent concepts from P{I) we introduce a language. 

Definition 4 (Concept Language). The concept language Lc is a set of de- 
scriptions c. 

To associate a description c G Lc with a concept c G P{I) that c represents, 
we define a function TZc- 

Definition 5. The function TZc : Lc ^ P{I) injective function that maps 

a concept description c & Lc to a concept c G P{I)- 

Since TZc is a function, no two distinct concepts in P{I) can be represented 
by the same description in Lc. Since TZc is injective, no two distinct descriptions 
in Lc can represent the same concept in P{I). 

Instances are related to concepts by the membership relation. The relation is 
projected into a cover relation between instances and concept descriptions [7]. 

Definition 6 (Cover Relation M). 

M : Lc X I ^ Boolean 
defined by: M{c,i) ^ i G TZc{c). 

The cover relation M holds for a description c G Lc and an instance z G / if 
and only if i is a member of the concept TZc (c) . If the relation M holds for c G Lc 
and i G I, we say that c covers i; otherwise, we say that c does not cover i. 

After the introduction of the elements of the concept-learning task we formu- 
late the task itself according to [7,8,11]. Given a universe L of all the instances, 
a concept language Lc, a cover relation M, and the training sets and I~ of a 
target concept, the task is to find descriptions of the target concept in Lc. 

2.2 Version Spaces 

A version space is a solution of the concept-learning task. It is a set of all the 
concept descriptions that are consistent with the training sets /+ and L~ [7,8]. 
A description c G Lc is consistent with the sets and I~ if and only if c covers 
each instance p G and does not cover any instance n G L~ . Below we give a 
formal definition of version spaces. 

Definition 7 (Version Space). Given the training sets I'^ and I~ of a target 
concept, the version space VS{I'^ ,L~) is defined as follows : 

VS{L~^ , I~) = {c G Lc 1 (Vp G L'^)M{c,p) A (Vn G L~)^M{c, n)}. 

To learn version spaces we need a representation. A version-space represen- 
tation is a structure that contains “information needed to reconstruct every de- 
scription in the version space” [7]. It has four possible characteristics: (1) com- 
pactness; (2) finiteness; (3) efficient computability [2,11]; and (4) efficiency of 
the algorithms for instance addition, instance retraction, version-space collapse 
and instance classification. To encapsulate these characteristics we introduce the 
notion of a useful version-space representation. 
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Definition 8 (Useful Version-Space Representation). A version-space rep- 
resentation is useful if and only if it is compact, finite, efficiently computable and 
has efficient algorithms for instance addition, instance retraction, version-space 
collapse, and instance classification. 

2.3 Admissible Concept Languages 

The key to find a compact version-space representation is to observe that concept 
languages can be ordered. This can be done by a partially-ordering relation 
“more general” . The relation is taken from [7,8] and is defined below. 

Definition 9 (Relation “More General” (>)). 

(Vci,C 2 G Tc)((ci > C 2 ) ^ (Vi G I){M{ci,i) ^ M{c 2 ,i))). 

A description ci G Lc is more general than a description C 2 G Lc (ci > C 2 ) 
if and only if for each instance i G / if C 2 covers i, then Ci covers i as well. If a 
description ci G Lc is more general than a description C 2 G Lc we say that ci is 
a generalisation of C 2 and C 2 is a specialisation of Ci . 

If the relation “>” is defined on a concept language Lc, then Lc is partially- 
ordered. For defining version-space representations one class of partially-ordered 
languages was extensively used, viz. the class of admissible concept languages 
[7,11]. It is introduced after the definition of minimal and maximal sets of a 
partially-ordered set (cf. [7]). 

Definition 10 (Minimal/Maximal Set). LfC is a partially-ordered set, then: 

MLN{C) = {c G C|(Vc' G C)((c > c') ^ (c' = c))} 

MAX{C) = {c G CKVc' G C)((c' > c) ^ (c' = c))}. 

A partially-ordered concept language Lc is admissible if each subset of Lc is 
bounded. A partially-ordered set is bounded if all the elements of the set are 
between its minimal and maximal elements. Below we give a formal definition. 

Definition 11 (Admissible Concept Language). A partially- ordered con- 
cept language Lc is admissible if and only if for every nonempty subset C C Lc: 

C C {c G Lc|(3s G MLN {C)){c > s) A (3g G MAX{C)){g > c)}. 

Given a version space VS{I^ ,L~) in an admissible concept language, the 
maximal set of VS{L^ ,I~) is known as the maximal boundary set of VS{I^ ,L~) 
and the minimal set of VS{I^ , L~) as the minimal boundary set of VS{L~^, I~). 

Notation 12. The maximal boundary set and the minimal boundary set of 
version space VS{I~^ , I~) are denoted by G{I^,L~) and 5'(/+,/“), respectively. 

3 Instance-Based Maximal Boundary Sets 

Below we introduce instance-based maximal boundary sets (IBMBS) as a new 
version-space representation. The correctness of the representation is proven for 
the class of admissible concept languages. The IBMBS are shown to be compact 
and their conditions for finiteness are derived. 
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3.1 Definition and Correctness 

The IB MBS representation consists of the set of positive training instances and a 
family of maximal boundary sets indexed by negative training instances. Below 
the representation is formally defined. 

Definition 13 (Instance-Based Maximal Boundary Sets). Consider an 
admissible concept language Lc and training sets I'^ C I and I~ C I so that 
I~ yf 0. Then the instance-based-maximal-boundary-set representation of a ver- 
sion space VS{I^ , I~) is the ordered pair M)} nGl-)- 

The IBMBS are “instance-based” since each of their elements corresponds to 
particular training instances. The IBMBS are “maximal boundary sets” since 
each of their elements in {G(/+, {n})} nei- is a maximal boundary set. The 
IBMBS are a one-sided version space representation since its first part is the set 
i.e., this part does not contain any boundary-set element. 

To prove that the IBMBS are a correct version-space representation we 
give theorems 14 and 15 from [II]. Theorem 14 states that if a description 
c G Lc is more specific than at least one element of each maximal boundary 
set G(/+,{n}) for all n G I~ , then c is consistent with the set I~ of negative 
instances. 

Theorem 14. If the concept language Lc is admissible, then: 

(Vc G Lc)((Vn G I-){3g G G{I+,{n})){g > c) ^ (Vn G I~)^M{c,n)). 

Theorem 15 states that a version space VS{I^ ,If) is a subset of a version 
space VS{l 2 , 12 ) if and only if every description in VS{Ii ,If) is consistent 
with the sets and If . 

Theorem 15. VS{lt,If) C VS{T^,If) ^ 

(Vc G VS{lt,If)WP € lt)M{c,p) A (Vn G If)^M{c,n)). 

Theorem 16 (Correctness of IBMBS). Gonszder a version space VS{I'^ ,I~) 
represented by IBMBS: (/■*■, {G(/“*', {n})}„g/-). If the concept language Lc is 
admissible, then: 

(VcGLc)(cG VS{I+,I-)^{{'dpGl+)M{c,p)A{'dnGl-){3gGG{I+,{n})){g>c))). 

Proof. (^) Consider an arbitrarily chosen description c G VS{I'^, I~). By theo- 
rem 15 (VnG/“)( IIS' (/+,/“) C V5'(/+, {n})). Thus, (VnG/“)(cG CS’(/''", {n})). 
Since Lc is admissible, according to definition 11 for each VS{I^ ,{n}) we 
have: 

{\fnGl-){3gGG{I+,{n})){g>c). (1) 

Since c G according to definition 7: 

(Vp G I'^)M{c,p). 



( 2 ) 
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From (1) and (2) the first part of theorem is proven. 

(^) Let c G Lc be arbitrarily chosen so that: 

(VpG/+)M(c,p). (3) 

(VnG/-)(35GG(J-,M))(g>c). (4) 

By theorem 14 formula (4) implies: 

{Wn € I~)^M{c,n). (5) 

Thus, c G Lc, (3), and (5) imply c G VS{I^ ,I~) according to definition 7. □ 



Given the IB MBS of a version space VS{I'^ ,I~) and an admissible concept 
language, theorem 16 states that the concept descriptions in VS{I^ ,I~) are 
exactly those that (1) cover all the positive instances in /+, and (2) are more 
specific than at least one element of each maximal boundary set G{I'^ , {n}). 
This means that the size of IBMBS is not tied to the number of descriptions in 
the version space. Thus, the IBMBS are a compact version-space representation. 



Example 1. Let the instance universe I and the concept language Lc be 1-CNF 
languages with 8 attributes. The domain of the fc-th attribute in I is {0, 1} and in 
Lc is {0, 1, ?}, where the symbol “?” indicates that any value is acceptable. The 
procedure of the cover relation M returns true for a concept description c £ Lc 
and an instance t G / if and only if for each attribute the values of c and i are 
equal or the value of c equals “?” . In this context we consider a concept-learning 
task with the set /+ consisting of one positive instance: = (1, 1, 1, 1, 1, 1, 1, 1) 

and the set L~ consisting of three negative instances: = (0, 0, 1, 1, 1, 1, 1, 1), 

Zg = (1, 1, 0, 0, 1, 1, 1, 1) and = (1, 1, 1, 1, 0, 0, 1, 1). For this task, the IBMBS: 
(/+, {G(/+, {zz})}ne/-) of the version space F5'(/+,/“) consist of four sets: 



/+ = {(!, 1 , 1 , 1 , 1 , 1 , 1 , 1 )}, 

G(/+, {i^}) = {(1, ?, ?, ?, ?, ?, ?, ?), (?, 1, ?, ?, ?, ?, ?, ?)}, 
G{I+, {^3 }) = {(?, ?. ?)> (?> ?)}, 

G{I+, {g }) = {(?, ?, ?, ?, 1 , ?, ?, ?), (?, ?, ?, ?, ?, 1 , ?, ?)}. 



□ 



3.2 Finiteness 

Since the IBMBS are compact, it is important to determine when they are finite. 
We introduce constraints on the training sets and the concept language. We show 
that they are sufficient and necessary conditions for the finiteness of the IBMBS. 
We start with the constraints on the training sets. 

Constraint 17. The training sets /“*■ and I~ are finite. 

Constraint 17 implies that the number of the maximal boundary sets G{L~^ , 
{rz}) is finite. Hence, IBMBS are finite in this case if each set G{L~^, {n}) is finite. 
To guarantee this property we introduce a constraint on the concept language. 
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Constraint 18. The maximal boundary set G(0, {n}) is finite for all n G I. 

To explain how constraint 18 affects each maximal boundary set M) 

theorem 19 is taken from [7]. The theorem states that the set G{I'^ U {z}, I~) is 
equal to the set of those elements of the set G{I~^, I~) that cover the instance i. 

Theorem 19. Consider training sets I^,I~ C I. If the concept language Lc is 
admissible, then for all p G I: 

G{I+ U M, J-) = {gG G{I+,I-)\M{g,p)}. 

An important consequence of theorem 19 is given in corollary 20 below. 
Corollary 20. Consider sets C I so that C /+. Then for all n G I: 

G{lt,{n})CG{lt,{n}). 

Using corollary 20 we formulate the following theorem. 

Theorem 21. The maximal boundary set G{I~^, {n}) is finite for all n G I and 
Q I if and only if constraint 18 holds. 

Combining constraints 17 and 18, and using theorem 21 we finish the section 
by formulating the theorem of the IBMBS being finite. 

Theorem 22. The IBMBS are finite if and only if constraints 17 and 18 hold. 



4 Algorithms of the IBMBS 

This section introduces four algorithms of the IBMBS. The instance-addition 
algorithm is given in subsection 4.1; the instance-retraction algorithm is given 
in subsection 4.2; the algorithm for version-space collapse is given in subsection 
4.3; and the instance-classification algorithm and its extension for noisy training 
data are given in subsection 4.4. 

4.1 Instance- Addition Algorithm 

The instance-addition algorithm of the IBMBS revises the representation given a 
new training instance. It is correct for the class of admissible concept languages. 
The algorithm consists of two parts for handling positive and negative training 
instances. They are based on theorem 23 and theorem 16, respectively. 

Theorem 23. Consider a version space VS{I'^ , I~) represented by IBMBS: 
(/■*■, {G{I~^ , {n})}n^j~) , and a version space VS{I'^ U{z},/“) represented by 
IBMBS: (/+ U {i},{G{I+ U {z}, {n})}„e/-). If the concept language Lc is ad- 
missible, then: 

G{I+ U {z}, {n}) = {gG G(/+, {n}) | M{g, i)} for all n G I~ . 

Proof. The theorem follows from theorem 19. □ 
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Instance- Addition Algorithm 

Input: i: a new training instance. 

IBMBS of VS{I+,r). 

Output: 

{/+ U{i},{G(7+ U{i},{n})}„g^_): IBMBS of VS(I+ U {i}, T) if i is positive. 

{/+, {G(7+, {n})}„g^-u{i}): IBMBS of VS{I+, I~ U {i}) if i is negative. 

Precondition: Lc is admissible. 

if instance i is positive then 
for n £ I~ do 

G(7+ U {*}, {n}) = {g£ G(7+, {n}) | M(p, i)} 
return (7+ U {i}, {G(7+ U {i}, {n})}„gj_) 
if instance i is negative then 
Generate the set G(0, {i}) 

G(7+, {i}) = {g£ G(0, {i})|(Vp G 7+)M(g,p)} 
return {{7+, {G(7+, {n})}„6/-u{i})- 



Fig. 1. The Instance- Addition Algorithm 



The instance-addition algorithm can be described as follows (see in figure 1). 
If a new positive training instance i is given, the algorithm forms the maximal 
boundary sets G(/+U{f}, {n}) for all n £ I~ . Each set G(/+U{t}, {n}) is formed 
from those elements of the corresponding set {n}) that cover the instance 

i. The resulting IBMBS of the version space VS{I^ U are formed from 

the set U {i} and the maximal boundary sets G(/+ U {i}, {n}) for all n £ I~ . 

If the instance i is negative, the algorithm first forms the maximal boundary 
set G(/+, {f}) in two steps. In the first step it generates the maximal boundary 
set G(0, {f}). In the second step the algorithm forms G(/+,{z}) from the ele- 
ments of G(0, {!}) that cover all the instances in 7+ (see theorem 19). Then, the 
resulting IBMBS of the version space U {f}) are formed from the set 

/+ and the maximal boundary sets {n}) for all n £ I~ U {i}. 



Example 2. Let us illustrate the instance-addition algorithm given the IBMBS 
from example 1. Assume that we have a new negative training instance = 
(1, 1, 1, 1, 1, 1, 0, 0). The algorithm first generates the maximal boundary set 
G(I+, {Z 5 }) = {(?, ?, ?, ?, ?, ?, 1, ?), (?, ?, ?, ?, ?, ?, ?, !)}• Then, it adds G(/+, {zg"}) 
to the IBMBS. The resulting IBMBS consist of five sets: 



7 + = {( 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 )}, 

G(7+,{z-|) = 

G{I+,{i^}) = {{?,?, 1, 7, 7,1, 7, 7, ?,?)}, 

G(7+,{z4|) = 

G(/+, |Z5 }) = {(?, ?, ?, ?, ?, ?, 1, ?), (?, ?, ?, ?, ?, ?, ?, !)}• 

Assume now that we have a new positive instance = (1, 0, 1, 0, 1, 0, 1, 0). 
The algorithm forms for each n £ I~ the maximal boundary set G(7“''U{z^ }, |n}) 
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from the elements of the set G(/+, {n}) that cover the instance ig . It adds the 
instance ig to the training set /■*■. The resulting IB MBS consist of five sets: 

/+ = {(!, I, I, I, I, I, I, I), (1,0, 1, 0,1, 0,1,0)}, 



G(/+,{Z4|) = {(?,?,?,?,!,?,?, ?)},G(/+,{i5-|) = {(?,?,?,?,?,?,1,?)|. □ 



4.2 Instance-Retraction Algorithm 

The instance-retraction algorithm of the IBMBS revises the representation when 
an instance is removed from one of the training sets. It is correct for the class of 
admissible concept languages when the property G holds [11]. 

Definition 24 (Property G). An admissible concept language is said to have 
property G if for all ni,U 2 G I: 

{g G G(0,{m})hM(g,n2)} = {g & G(0, {n2})hM(5, m)}. 

An admissible concept language has the property G if for all ni , ri2 G I the 
subset of the elements of the set G(0, {ni}), that do not cover the instance ri2, 
equals the subset of the elements of the set G(0,{n2|), that do not cover the 
instance ni. A simple consequence of the property G is given in a corollary below. 

Corollary 25. If the property G holds, then for all ni,ri2 G I, and all C I: 

{g G G(/+,{ni})hM(g,n2)| = {g & G(/+, {n2})hM(g, m)}. 

The instance-retraction algorithm consists of two parts for handling positive 
and negative instances. They are based on theorems 26 and 16, respectively. 

Theorem 26. Gonsider a version space VS{I^ ,I~) represented by IBMBS: 
(/■*■, {G(/“*', {n-DIrie/-), and a second version space /“) represented 

by IBMBS: {I~^ \ {i|, {G(/“'' \ {0? where i G /■*". If the concept 

language Lc is admissible and the property G holds, then: 

G{I+\{t},{n}) = G{I+,{n})U{geG{I+\{i},{i})\^M{g,n)} for aline I~ . 



Proof. For each n e I : 

G{I+\{z},{n}) = {geG{I+\{i},{n})\M{g,z)}U{geG{I+\{i},{n})hM{g,z)}. 
According to theorem 19: 

{5 G G(/+ \ {*}, {n})\M{g, i)| = G(/+, {n}) 
and according to corollary 25: 

{5 G G(/+ \ {*}, M)hM(g, *)} = {ge G(/+ \ {*}, {i})hM{g, n)|. 

Thus, 

G(/+ \ {i|, {n}) = G(/+, M) U {5 G G(/+ \ {*}, {i})hM{g, n)|. □ 
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Instance- Retraction Algorithm 
Input: i: a training instance in VJ 1~ . 

<{/+,{G(7+,{n})}„g,-): IBMBS of VS{I+,r). 

Output: ({/+\ {i},{G(/+ \{i},{n})}„g^-): IBMBS of VS{I+ \{i},r) if ie/+. 

{{/+, {G(7+, IBMBS of ^(7+, 7“ \ {i}) if i &J~ . 

Precondition: Lc is admissible, property G holds, and |7 | > 1 if i € 7 . 



if i G 7"*" then 

Generate the set G(0, {i}) 

G(7+ \ {*}, {*}) = {<? G G(0, {i})|(Vp G 7+ \ {i})M{g,p)} 

for n € I do 

G(7+ \ {*}, {n}) = G(7+, {n}) U {g G G(7+ \ {i}, {*}) | ^M(g, n)} 
return {7+ \ {i},{G(7+ \ {*}, {n})}„gj- ) 

if i G 7” then 

return (7+, {G(7+, {n})}„gj-\{i}). 



Fig. 2. The Instance-Retraction Algorithm 



The instance-retraction algorithm can be described as follows (see figure 2). 
If an instance i is removed from the set /■*", the algorithm executes two steps 
following theorem 26. In the first step it forms the maximal boundary set G(7+ \ 
This is done by first generating the maximal boundary set G(0, {i}) 
and then by removing those elements of G(0, {i}) that do not cover at least 
one instance in 7+ \ {f} (see theorem 19). In the second step the algorithm 
forms the maximal boundary set G(7+ \ {f},{n}) for each n G 7“. The set 
G(7+ \ {f}, {n}) is formed as a union of the corresponding sets G(7+, {n}) and 
{g G G(7+ \ {f},{f}) \ ^M{g,n)}. The resulting IBMBS of the version space 
VS{I~^ \ {f},7“) are formed from the set 7+ \ {f} and the maximal boundary 
sets G(7+ \ {z}, {n}) for all n G I~ . 

If the instance i is removed from the set I~ , the algorithm forms the resulting 
IBMBS of the version space \ {z}) from the set 7+ and the maximal 

boundary sets G(7+, {rz}) for all zz G 7“ \ {z}. 



Example 3. Let us illustrate the instance-retraction algorithm given the last 
IBMBS from example 2. Note that the property G holds for the concept lan- 
guage used. Assume that we have to retract the positive training instance = 
(1, 0, 1, 0, 1, 0, 1, 0). The algorithm forms the boundary set G(7+\{zg },{zg }) = 



//? 17 ? 7 ? 7 ?\ /7 771777 7 \ /7 777717 7 \ /7 7777771 \ 1 . 



Then, it forms for each n G I~ the maximal boundary set G(7+ \ {z)}" }, {zz}) as 
a union of the sets G{I^ , {rz}) and {g G G(7+ \ (zg }, |zg }) | ^M{g, zz)}. The in- 
stance Iq is excluded from the training set 7+ and the resulting IBMBS coincide 
with the first IBMBS from example 2. 

Assume now that we have to retract the negative training instance Zg = 
(1, 1, 1, 1, 1, 1, 0, 0). The algorithm excludes: (1) the instance from the training 
set I~ , and (2) the maximal boundary set G(7+,{z5}) from the IBMBS. The 
resulting IBMBS coincide with those from example 1. □ 




280 



E.N. Smirnov et al. 



4.3 Algorithm for Version-Space Collapse 

The algorithm for version-space collapse checks whether a version space repre- 
sented by IBMBS is empty. It is proposed for the class of admissible concept 
languages when the intersection-preserving property (IP) holds [11]. 

Definition 27 (Intersection-Preserving Property (IP)). An admissible 
concept language is said to have the intersection-preserving property if for each 
nonempty set C C Lc there exists a description c € Lc so that: 

(Vi G /)((Vc' G C)M{c',i) ^ M{c,i))- 

An admissible concept language Lc exhibits the property IP when for each 
nonempty subset C C Lc there exists a description c € Lc so that an instance 
i G / is covered by all the descriptions c' G C if and only if i is covered by c. The 
property is introduced because it guarantees that if the training set I~ is not 
empty, the version space VS{L~^ , I~) is not empty if and only if for each n G I~ 
the version space VS{I^ ,{n}) is not empty (see theorem 28 taken from [11]). 

Theorem 28. Consider an admissible concept language Lc such that the prop- 
erty IP holds. Lf the set I~ is nonempty, then: 

(P5(/+,/-) ^ 0) ^ (Vn G L-){VS{L+,{n}) ^ 0). 

To check a version space VS{I^,L~) for collapse, by theorem 28 we can check 
for collapse of the version spaces {n}) for n G I~ . Since VS{I^ , {n}) are 

given by maximal boundary sets G(/“*',{n}) in the IBMBS of VS{I^ ,L~), we 
give a relation between the sets G{L^ , {n}) and version spaces VS{I^ , M) [11]. 

Theorem 29. {VS{I+ ,I~) ^ ^ (G(/+, /") yf 0). 

Theorems 28 and 29 imply corollary 30 below. It states that if the property 
IP holds and the training set I~ is nonempty, the version space VS{I'^ , L~) is 
nonempty if and only if for each n G L~ the set G(/+, {n}) is nonempty. 

Corollary 30. Consider an admissible concept language Lc such that the prop- 
erty IP holds. Lf the set I~ is nonempty, then: 

(P5(/+, J-) 0) ^ (Vn G /-)(G(/+,M) 0). 

The version-space collapse algorithm is given in figure 3. If a version space 
VS{I~^ , I~), given by IBMBS, is checked for collapse, the algorithm visits the 
maximal boundary sets {n}) for n G If none of the sets G{I~^, {n}) is 

empty, by corollary 30 VS{I^ , L~) is not empty and the algorithm returns false. 
Otherwise, by corollary 30 VS{I~^ , I~) is empty and the algorithm returns true. 

Example 4. Let us illustrate the algorithm for version-space collapse given 
the IBMBS from example 1. Note that property IP holds for the concept lan- 
guage used. The algorithm checks the maximal boundary sets G{L~^ ,{if}), 
G(/+,{z 3 }), and G{I'^ ,{if}). Since none of them is empty, the algorithm re- 
turns false; i.e., the version space is nonempty. □ 
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VS- Collapse Algorithm 

Input: ({/+,{G(/+,{n})}„g,-): IBMBS of VS(I+,r). 
Output: true if VS{I~^,I ) = 0. 

false if VS{I+,r) 7 ^ 0. 

Precondition: Lc is admissible and the property IP holds. 

for n £ I~ do 

if G{I'^ , {n}) = 0 then 
return true 
return false. 



Fig. 3. The Algorithm for Version-Space Collapse 

4.4 Instance-Classification Algorithm 

Instance classification with version spaces is realised by the unanimous-voting 
rule: an instance is classified if and only if all the descriptions in a version space 
agree on a classification of the instance [7,8]. The rule can be implemented using 
theorems 31 and 32 taken from [11]. Theorem 31 states that all the descriptions 
of a version space VS{I^ , I~) do cover an instance i £ I if and only if the version 
space I~ U {i}) is empty. Theorem 32 states that all the descriptions of 

a version space Ids' (/^ , I~) do not cover an instance i £ I if and only if the 
version space VS'(/+ U {t}, /”) is empty. 

Theorem 31. (ViG/)((VcG VS{I+ , I~))M{c,i) ^ {VS{I+ , I~ U {i}) = (d)). 
Theorem 32. (ViG/)((VcG VS{I+ , ^ {VS{I+ U {i}, I~) = (d)). 

The instance-classification algorithm of the IBMBS realises the unanimous- 
voting rule for the class of admissible concept languages if the property IP holds. 
The positive instance classification is based on theorem 31, and the negative 
instance classification is based on theorem 33. Theorem 33 is used instead of 
theorem 32 for efficiency reasons. It states that if the concept language is ad- 
missible and the property IP holds, then all the descriptions of a version space 
VS{I~^ , I~) do not cover an instance i G / if and only if there exists a version 
space VS{I~^, {n}) of which all the descriptions do not cover the instance i. 

Theorem 33. Consider an admissible concept language Lc such that the prop- 
erty IP holds. If the set I~ is nonempty, then: 

(ViG/)((VcG VS{I+,I~))^M{c,i) ^ {3n £ r){yc£ VS{I+ ,{n}))^M{c,i)). 
Proof. Consider an arbitrary i £ I. Then: 

(Vc G VS{I^ ,I~))^M{c,i) iff (theorem 32) 

VS{I'^ U {i},I~) = 0 iff (theorem 28) 

(3n G I~) VS{I'^ U {f}, {n}) = 0 iff (theorem 32) 

(3n G /”)(Vc G VS{I+, {n}))-M(c, i) □ 
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By theorem 33 a negative instance classification can be obtained by the ver- 
sion spaces VS{I^ , {^})- Since VS{I^ ,{n}) are given with the maximal bound- 
ary sets G(/+, {n}) in the IBMBS of VS{I^ ,I~), we show how to use these sets 
for the classification using theorem 34 from [11]. 

Theorem 34. (ViG/)((VcG VS{I+,I-))^M{c,i)^{\/gGG{I+,I~))^M{g,i)). 

Theorems 33 and 34 imply corollary 35 below. Corollary 35 states that if an 
admissible concept language has the property IP, then none of the descriptions 
of a version space I~) covers an instance i G / if and only if there exists 

a set {n}) of which all the descriptions do not cover the instance i. 

Corollary 35. Consider an admissible concept language Lc such that the prop- 
erty IP holds. If the set I~ is nonempty, then: 

(Vt G I)((Vc G VS{I+,r))^M{c,i) ^ {3n G /-)(Vg G G{I+ , {n}))^M{g,i)). 

The instance-classification algorithm of the IBMBS is shown in figure 4. 
Given a nonempty version space VS{I~^ , I~), it classifies an instance i G / in 
two steps. In the first step the algorithm forms the IBMBS of the version space 
VS{I~^, I~ U {t}) using the instance-addition algorithm applied on the IBMBS of 
VS{I~^, I~) with the instance i labeled as negative. If VS{I^ , I~ U{i}) is empty, 
by theorem 31 all the descriptions in VS{I~^ , I~) cover the instance. Hence, 
the instance i is positive and the algorithm returns “-I-”. If VS{I'^,I~ U {t}) 
is not empty, during the second step the algorithm visits the sets G(/+, M) 
for n G I~ ■ If none of the elements of one of these sets covers the instance i, 
by corollary 35 all the descriptions in VS{I~^ , I~) do not cover the instance. 
Thus, the instance i is negative and the algorithm returns Otherwise, the 
algorithm returns “?”. 



Instance-Classification Algorithm 
Input: i: an instance to be classified. 

{7+,{G(/+,{n})}„g,_): IBMBS of VS{I+,I~). 

Output: “-b” if (VcG VS{I+,r))M{c,i). 

if (VcG VS{I+,r))-nM{c,i). 

“?” otherwise. 

Precondition: Lc is admissible, the property IP holds, and VS{I'^ , I~) yf 0. 
label i as negative 

({/+, {G(/+, {n})}„6/-u{i}) = Instance- Addition{i, ({/+, {G(7+, {n})}„gj_)) 
if yS'-CoZ/apse((7+,{G(7+,{n})},^g^-Lj{i})) then 
return “-I-” 
for n £ I~ do 

if (Vff G G(7+,{n}))-nM(5r,i) then 
return ” 
return “?”. 



Fig. 4. The Instance-Classification Algorithm 
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Example 5. Let us illustrate the classification algorithm given the IBMBS from 
example 1. Assume that we have to classify instance i = (1,1,1,1)1)1)07 0)- In 
the first step the algorithm updates the IBMBS with the instance i considered 
as negative. The resulting IBMBS coincide with the first IBMBS from example 
2 and represent a nonempty version space. In the second step the algorithm 
determines that all the elements of the maximal boundary sets {t^}), 

G(/+, {zg }), and G)/"*", {14 }) do cover the instance. Thus, the algorithm returns 
i.e., the instance classification cannot be determined. □ 

The instance classification with the IBMBS can be extended to situations 
when the training instances are noisy. The key idea is to use flexible matching 
between instances and concept descriptions. Below we describe two procedures, 
based on flexible matching, for positive and negative classification, respectively. 

The positive classification procedure, given an instance i to be classified, 
first forms the maximal boundary set G(0, {z}). Then for each positive training 
instance p G it determines the number of descriptions g G G(0, {z}) that do 
not cover the instance. If at least Pp positive training instances are not covered 
by at least Pg descriptions g G G(0, {z}), the instance z is classified as positive, 
where Pp and Pg are parameters of flexible matching. 

The negative classification procedure is similar to that given in [10]. Given 
an instance z to be classified, it determines for each negative training instance 
zz G /“ the number of descriptions g G G(/“*", {rz}) that do not cover the instance 
z. If there exist at least iV„ maximal boundary sets G)/"*", {rz}) of which at least 
Ng descriptions do not cover the instance z, then the instance is classified as 
negative, where N„ and Ng are parameters of flexible matching. 

5 Analysis 

This section analyses the IBMBS. Subsection 5.1 gives a worst-case complex- 
ity analysis of the IBMBS and the algorithms presented. Subsection 5.2 uses 
the results of the analysis to determine (1) whether the IBMBS algorithms are 
efficient, and (2) whether the IBMBS are efficiently computable. 

5.1 The Worst-Case Complexity Analysis 

The worst-case complexity analysis is made in terms of the computational char- 
acteristics of admissible concept languages. The characteristics are chosen so 
that they do not depend on the size of the training data. They are given below: 

Pn- the maximal size of the maximal boundary set G(0, {rz}) for all rz G /; 

: the maximal time for generating the set G(0, {rz}) for all rz G /; 

Spi the maximal size of the minimal boundary set S'({p}, 0) for all p G I; 

fl; the maximal time for generating the set 5'({p}, 0) for all p G / 

tm- the maximal time of the operator of the relation M(c, z) for all c G Lc, i G P 



^ The computational characteristics Up and t} are given for the complexity analysis 
of the instance-based minimal boundary sets presented in section 7. 
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The condition for the worst-case complexity analysis is that the size of the 
maximal boundary sets G(/+, {n}) is equal to the size for all n G I, /+ C /. 

Space Complexity 

The worst-case space complexity of the IBMBS is [/■'■I plus the worst-case space 
complexity 0(|/”|r„) of the G-part. Thus, it is 0(|/+| -I- |/“|T„). 

Time Complexities 

The Instance-Addition Algorithm. The worst-case time complexity of the algo- 
rithm part for processing one positive instance is 0{\I~\rntm)- The factor |/“| 
arises because we have \I~ \ maximal boundary sets G(/+ U {t}, {n}). The factor 
Tntm arises because in order to form each maximal boundary set G(/“*'U{z}, M) 
we test Tn elements of the set G{I'^ , {n}) whether they cover the instance. 

The worst-case time complexity of the algorithm part for processing one 
negative instance is 0(tJj -I- The term O(tJj) arises because the max- 

imal boundary set G(0, {i}) is generated. The term 0(|/+|r„tm) arises because 
the maximal boundary set G(/+,{z}) is generated from elements of the set 
G(0, {i}) that are tested to cover all the positive instances in the set /+. 

The Instance- Retraction Algorithm. The worst-case time complexity of the algo- 
rithm part for processing one positive instance is the sum 0{t]^ -\- \I^\Tntm) + 
0(|/“|r„fm). The term 0(4 -I- |/+|r„t,„) is the worst-case time complexity 
for generating the maximal boundary set G{I'^ \ {t},{i}). (The sub-term 4 
arises because the set G(0, {i}) is generated. The sub-term \I~^\Tntjn arises be- 
cause the size of the set G(0, {t}) is and each element of G(0, {z}) is checked 
whether it covers all the positive instances in the set \ {z}-) The second 
term 0(|/“|T„tm) is the time complexity for constructing the maximal bound- 
ary sets G(/+ \ {44^}) foi' ^ii ^ € I - (The factor |/“| arises because we 
have |/“| sets G(/+ \ {z},{rz}). The factor Tntm arises because formation of 
each set G(/+ U {i}, {rz}) requires r„ elements of the set G(/+ \ {z}, {4) to be 
tested not to cover the corresponding instance n.) Thus, the worst-case time 
complexity of this part of the algorithm is 0(4 + \I~^\Tntm) + 0{\I~\rntm) = 

0(4 + + 4 l)Tntm)- 

The worst-case time complexity of the algorithm part for processing one 
negative instance is 0(1) because its maximal boundary set is removed only. 

The Algorithm for Version-Space Collapse. The worst-case time complexity of 
the algorithm is 0(|/“|). The term |/“| arises because in the worst case |/“| 
maximal boundary sets G{I~^,{n}) are checked whether they are empty. 

The Instance- Classification Algorithm. The instance-classification algorithm con- 
sists of two parts. The first part is the positive instance-classification part. 
Its worst-case time complexity is the sum 0(4 + -|- 0(|/“|). (The 

first term is the worst-case time complexity of the instance-addition algorithm 
given the instance z to be classified labeled as negative. The second term is 
the worst-case time complexity of the algorithm for version-space collapse.) 
The second part is the negative instance-classification part. Its worst-case time 




One-Sided Instance-Based Boundary Sets 



285 



complexity is 0{\I~\rntm)- The factor |/“| arises because we have |/“| maxi- 
mal boundary sets G(/+, {n}). The factor Fntm arises because the elements of 
each maximal boundary set G(/+,{n}) are tested not to cover the instance i. 
Thus, the worst-case time complexity of the instance-classification algorithm is: 

o{\i-\)+o{tl+\i+\r^t^)+o{\i-\)+o{\i-\r^tm) = o{tl+{\i+\+\i-\)r^t^). 

The IB MBS complexities are summarised in table 1. 



Table 1. Worst-Case Complexities of the IBMBS and their Algorithms 

S]Dabe^ 0{\I+\ + |/-|r„) 

Time 

Instance-Addition Algorithm (© instance): 0(|/“|TTitm) 

Instance- Addition Algorithm (0 instance): 0(tj, + \I^\r„tm) 

Instance-Retraction Algorithm (© instance): 0(tj, + (|/^| + \I~\)r„tm) 
Instance-Retraction Algorithm (Q instance): 0(1) 

Version-Space Collapse Algorithm: 0(|/~|) 

Instance-Classihcation Algorithm: 0(tj, + (|/^| + \I~\)r„tm) 



5.2 IBMBS and Efficiency 

To determine whether the algorithms of the IBMBS are efficient we employ a rule 
proposed in [2]: an algorithm of a version-space representation is efficient for a 
concept language if the worst-case time complexity of the algorithm is polynomial 
in the computational features of the language and the size of the input. From 
the previous subsection we know that the worst-case time complexities of the 
IBMBS algorithms are polynomial in the computational characteristics tj,, 
tra, and the sizes ll+l and |/“|. In this context we note that the upper bound of 
the size of the input of the algorithms is the size of the IBMBS; i.e., |/+| plus 
\I~\rn- Thus, we conclude that the IBMBS algorithms for instance addition, 
instance retraction, version-space collapse and instance classification are efficient 
for admissible concept languages. In addition, we emphasise that the instance- 
retraction algorithm does not recompute the IBMBS. 

To determine whether the IBMBS are efficiently computable we employ a 
second rule proposed in [2]: a version-space representation is efficiently com- 
putable for a concept language if in the worst case its size is polynomial in the 
computational features of the language and the sizes of the training sets, and 
the representation has an efficient instance-addition algorithm. ^From the pre- 
vious subsection we know that the worst-case space complexity of the IBMBS is 
polynomial in the computational characteristic and the sizes |/^| and |/“|. 
Since the IBMBS instance-addition algorithm is efficient we conclude that the 
IBMBS are efficiently computable for admissible concept languages. 



6 Usefulness of the IBMBS 

This section evaluates the usefulness of the IBMBS. For this purpose we sum- 
marise the IBMBS employing the characteristics of a useful version-space repre- 
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sentation (definition 8). The characteristics are compactness, finiteness, efficient 
computability, and efficiency of the IBMBS algorithms. 

We showed that the IBMBS are a correct and compact version-space rep- 
resentation for admissible concept languages (section 3). They are finite if the 
training sets are finite and the maximal boundary set G(0, {n}) is finite for all 
n G I. The IBMBS are efficiently computable and have efficient algorithms for 
instance addition, instance retraction, version-space collapse and instance classi- 
fication for admissible concept languages (sections 4 and 5). The only restrictions 
are that the instance retraction algorithm requires the property G while the algo- 
rithms for version-space collapse and instance classification require the property 
IP on the concept language used. 

From this summary we conclude according to definition 8 that the IBMBS are 
a useful version-space representation for the class of admissible concept languages 
if the training sets are finite, the maximal boundary set G(0, {n}) is finite for 
all n G I, and the property G as well as the property IP hold. 

7 Instance-Based Minimal Boundary Sets 

Instance-based minimal boundary sets (IBmBS) and their algorithms can be de- 
rived by duality from the previous sections^. Therefore, we refrain from providing 
details. The IBmBS complexities are given in table 2. 

Table 2. Worst-Gase Gomplexities of the IBmBS and their Algorithms 

o(|/+|i:j,-H7-|) 

Time 

Instance-Addition Algorithm (© instance): 0{tp + \I~\Sptm) 

Instance-Addition Algorithm (0 instance): 0{\I^\Sptm) 

Instance-Retraction Algorithm (© instance): 0(1) 

Instance-Retraction Algorithm (Q instance): 0{tp + (|7^| + \I~\)Sptm) 
Version-Space Collapse Algorithm: 0(|7^|) 

Instance-Classihcation Algorithm: 0{tp + (|7^| + \I~\)Sptm) 



8 Comparison with Relevant Work 

Below we compare the IBMBS and the IBmBS with the training-instance rep- 
resentation [3] and the instance-based boundary-set representation [11,12], i.e., 
with version-space representations that are efficient for instance retraction. The 
comparison is made using the characteristics of useful version-space representa- 
tions (definition 8). 

The training-instance representation is a version-space representation that 
consists of the sets of positive and negative training instances. By definition the 



® We note that the dual of the property G is the property S, and the dual of the 
intersection-preserving property (IP) is the union-preserving property (UP) [11]. 
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representation is compact. Obviously, the conditions for finiteness of the training- 
instance representation are a subset of the conditions for finiteness of the IB MBS 
(IBmBS). An analogous conclusion can be derived when the representations are 
compared with respect to efficient computability. The training-instance represen- 
tation allows much more efficient algorithms for instance addition and instance 
retraction. This advantage comes with a price: the instance-classification algo- 
rithm determines the classification of each instance by a search in the concept 
language using all the training data and the instance [2]. Thus, the training- 
instance representation has only a theoretical value. This contrasts with the 
IBMBS and the IBmBS: their instance-classification algorithms are not based 
on search and this is one of the factors of their usefulness. 

The instance-based boundary-set representation (IBBS) is a useful version- 
space representation that consists of a family of minimal boundary sets indexed 
by positive training instances and a family of maximal boundary sets indexed 
by negative training instances [11,12]. The representation is correct and com- 
pact for admissible concept languages. It is possible to prove that the conditions 
for finiteness of the IBBS are a superset of the conditions for finiteness of the 
IBMBS (IBmBS). In order to compare the representations in terms of efficiency 
we examine the IBBS worst-case complexities given in table 3. An analysis of 
these complexities shows that each of them is equal to the sum of the correspond- 
ing complexities of the IBMBS and IBmBS. Thus, the IBMBS and the IBmBS 
have two advantages: (1) they are more efficiently computable than the IBBS, 
and (2) the IBMBS and the IBmBS algorithms for instance addition, instance 
retraction, version-space collapse, and instance classification are more efficient. 
Moreover, the applicability of the IBBS instance-retraction algorithm is more 
restricted: the algorithm can be applied only if both properties S and G hold. 

Table 3. Worst-Case Complexities of the IBBS and their Algorithms 

Sisabe^ 0{\I+\Sp + \r\rn) 

Time 

Instance- Addition Algorithm (© instance): 0(tp + |/“|(X'p -|- r„)tm) 

Instance- Addition Algorithm (Q instance): 0(tl -f \I^\{Sp + T’„)tm,) 

Instance-Retraction Algorithm (© instance): 0{t}^ + (|/’''| -I- \I~\)rntm) 
Instance-Retraction Algorithm (0 instance): 0{tp + (|/”''| + \I~\)Sptm) 
Version-Space Collapse Algorithm: 0(|/’^| + |/~|) 

Instance-Classification Algorithm: 0(tj, + tl^ + (|7”'"| + \l~\){Sp + r„)tm.) 



9 Conclusion 

This chapter introduced a family of useful version-space representations called 
one-sided instance-based boundary sets (IBMBS and IBmBS). We showed that 
these representations are correct and compact for the class of admissible concept 
languages. This allowed us to derive the conditions for finiteness. In addition, 
we demonstrated that the one-sided instance-based boundary sets are efficiently 
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computable and have efficient algorithms for instance addition, instance retrac- 
tion, version-space collapse and instance classification for the class of admissible 
concept languages. We compared the one-sided instance-based boundary sets 
with other existing version-space representations that are efficient for instance 
retraction. From the comparison we conclude that the one-sided instance-based 
boundary sets are at the moment the most efficient useful version-space repre- 
sentations for instance retraction. So, our research question from section 1 has 
been answered positively. 
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Abstract. Events are used to monitor many types of processes in sev- 
eral technical domains. Computers and efficient electronic communica- 
tion networks make it very easy to increase the accuracy and amount of 
logged details. While the size of logs is growing, the collection and analy- 
sis of them are becoming harder all the time. Frequent episodes offer one 
possible method to structure and find information hidden in logs. Un- 
fortunately, as events reflecting simultaneous independent processes are 
stored to central monitoring points, signs of several unrelated phenomena 
get mixed with each other. This makes the algorithm searching for fre- 
quent episodes to produce accidental and irrelevant results. As a solution 
to this problem, we introduce here a notion of domain constraints that 
are based on distance measures, which can be defined in terms of domain 
structure and used taxonomies. We also show how these constraints can 
be used to prune irrelevant event combinations. 



1 Introduction 

Episode rules are a powerful way to search and describe patterns in event se- 
quences. Even though the algorithms are fast, they easily find too much new 
information. The resulting problem, sometimes referred to as second order data 
mining, has been recognized quite early (e.g. [8]). In the case of telecommunca- 
tion network alarm and system log analysis, for example, two different kinds of 
problems should be emphasized. In the pool of discovered rules — among the 
interesting ones — there are (1) proper rules that are of no interest and (2) rules 
that have been generated between items or sets of items that can not have any 
interdependence with each other. 

In this paper we concentrate on the second problem: how to minimize the ef- 
fect of accidental event combinations in analysis, especially in telecommunication 
network log analysis. The telecommunication networks produce large amounts 
of different types of events that are logged in central monitoring points. These 
events include different log entries reflecting normal operation of the network as 
well as alarm information about faults and problems that occur. 

In the network there are all the time plenty of independent processes going 
on. These processes emit alarms, when they get disturbed by faults. It often 
happens that many independent processes get simultaneously affected by a fault 
and they all start to alarm, not necessarily about the fault itself, but about its 



R. Meo et al. (Eds.): Database Support for Data Mining Applications, LNAI 2682, pp. 289— 305, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




290 



K. Hatonen and M. Klemettinen 



secondary reflections. Thus, alarms and log entries, which are sent, actually carry 
second-hand information about the incident. They do not necessarily identify the 
primary fault at all. 

Alarms that network processes emit are collected to the centralized monitor- 
ing points. This makes the analysis even more difficult, because at each mon- 
itoring point, the symptoms and reflection of separated problems are merged 
into one information flow. The combined flow also contains noisy information 
caused by natural phenomena like thunderstorms or by normal maintenance 
operations. 

A starting point for a network analyst in a fault condition is always local- 
ization and isolation of the fault, i.e., finding the area where the problem is 
located and identification of all network elements that are affected by the fault. 
Localization and isolation is based on the assumption that it is probable that 
the fault itself is local although its reflections are widespread. In this situation 
alarms coming from the same network element or its direct neighbors are related 
to one reflection of the fault. After the localization has been done it is easier to 
do the actual identification of the fault. 

Episode and association rule based techniques have been used in semi- 
automatic knowledge acquisition from alarm data in order to collect the re- 
quired knowledge for knowledge based systems like alarm correlators [5,7,6]. 
Episode rules [9,10] describe temporal proximity and temporal ordering of re- 
current combinations of alarms in a given alarm database. Association rules [1,2], 
in turn, describe the properties of individual alarms without taking the temporal 
relationships of the alarms into account. Given such rules holding in an alarm 
database, a fault management expert is able to verify whether the rules are useful 
or not. Some of the rules may reflect known causal connections, and some may 
be irrelevant, while some rules give new insight to the behavior of the network 
elements. Selected rules can be used as a basis of correlation patterns for alarm 
correlation systems. 

Based on our experience, simple-minded use of discovery algorithms poses 
problems with the amount of generated rules and their relevance. In the KDD 
process [3], it is often reasonable or even necessary to constrain the discovery 
using background knowledge. If no constraints are applied, the discovered re- 
sult set of, say, episode rules might become huge and contain mostly trivial, 
uninteresting or even impossible rules. 

The basic methods for finding episodes use event distances in time but ignore 
all the other knowledge about the domain. There is, however, plenty of other use- 
ful domain knowledge, e.g., topological information, available from a technical 
domain like telecommunications networks. Probably the most common informa- 
tion is some kind of part-of hierarchy between domain objects. Also different 
taxonomies and definitions of control hierarchy and material flows between ob- 
jects are usually available. We suggest that this background knowledge can be 
used as a basis for domain specific distance measures. These distance measures 
can then be used to prune out accidental event occurences that are caused by 
simultaneous but independent phenomena in the network. 
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This kind of approach to incorporate domain knowledge into the mining 
process by using distance constraints is general. It is not limited only to a task 
of finding correlating event sequences but it can be applied to mining of different 
types of patterns. It can also be used in mining market basket type of data, 
especially if there are taxonomies available. 

There are some earlier studies that discuss the problem of incorporating con- 
straints to the data mining algorithms. Most of the approaches (e.g. [11,4]) con- 
centrate on selecting interesting items or item combinations outside the analysis 
process. The analysis is then focused so that only those patterns describing the 
selected items are searched for. Zaki [12] studies the relationships of single events 
in multiple event sequences. He introduces templates that are used to restrict 
supporting occurences only to those that fulfill given time interval restriction. 

In this article, we introduce methods to apply domain knowledge in searching 
for episodes. First, in Section 2 we present different approaches that take the 
structure of the domain into consideration. In Section 3 we give empirical evi- 
dence from real-life telecommunication alarm datasets supporting the use of the 
presented methods. In Section 4 we generalize the notion of distances and show 
how several distance constraints can be combined. Finally, Section 5 contains 
general discussion, conclusions, and issues for future work. 



2 Application of Domain Knowledge 

Episode rules and episodes are a modification of the concept of association rules 
and frequent sets, applied to sequential data. In this paper we adopt the definition 
given by Mannila and Toivonen [9] and concentrate on reducing the number of 
invalid episodes, which has a straightforward impact on the number and quality 
of resulting episode rules. We have also adopted the basic algorithm they present. 

2.1 Data 

The data set S used in the episode calculation is usually a set of tuples of the 
form (t, E), where t is an event occurrence time and E is an event type. An event 
type can be either a type number or some other field of the event, e.g., a message 
text attached to it. Episodes are calculated using these event types. 

In reality, this data provides only a small fraction of the available information 
about the events and the underlying network. For instance, by looking at the 
network configuration — both the topological structure and the taxonomies of 
event types — it is possible to prune impossible relationships between events, 
i.e., to reduce the number of episodes found. 

To make it possible to use this type of knowledge about the structure of the 
domain, the data format presented above has to be expanded by adding the 
infomation about the source of the event. Thus, the data set will be a collection 
of triples of the form (t,s,E), where s is a sender or source of the event. For 
example, in the alarm database tests presented below, events are of the form 
{time, N Eld, alarm type), where time gives the beginning time of an alarm. 
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NEId the unique id of the network element that has sent the alarm and alarm 
type the type of the alarm. 



2.2 Domain Model 

In addition to the data, it is often possible to have a model M = {N, F, R, D) 
about the domain and especially about its structure. Here N = {neo, . . . ,ne„} 
is a set of objects of the domain, is a set of facts defining the structure of 
the domain, i.e., the relationships that connect the objects together, i? is a set 
of relations that define how it is possible to deduce new relationships, and D 
is a set of distance functions di{source{ej), source{ek)) = di{nen,nem), where 
Cj, Cfe G S and nen,nem G N- These distance functions return the distances of 
sources of two events in the defined structure. 

In the telecommunication network data sets used in experiments of this arti- 
cle, for example, the model Ai consists of N that is a set of network elements, 
a configuration table F that is a binary relation and defines the control rela- 
tions between network elements, a set of relations R that show how different 
ancestor or sibling relations can be deduced for a node and a distance function 
do{source{ei) , source(ej)) that returns an integer which corresponds to a num- 
ber of steps that must be taken in the control hierarchy in order to travel from 
the network element that sent the event to another one that sent the event ej . 

The model Ai can be used to restrict the generation of P, a set of frequent 
episodes. For this, it is possible to use a domain constraint C = co,...,Cj to 
give maximum thresholds for corresponding distance functions di{). During the 
computation of episode candidate supports a support will be increased only if 
it is true for all required distances di that di{source{ej), source{ek)) < Ci for all 
pairs of events ej , in a possible candidate occurrence. 

On the left in Figure 1 there are two nodes that send alarms, which are 
collected to the same database. However, the nodes are situated far from each 
other in the network topology; i.e., the node is not directly connected to the 
node N 2 but there are several steps between them. If the distance function do{) 
has been defined as a number of node-to-node steps between network nodes, 
it returns a value that is larger than one. In the real world situation, it is, for 
example, most likely that if the di{Ni,N 2 ) grows much larger than 1 then the 
probability that events, which either of the nodes sends, could correlate with 
events from the other one, decreases^. 

A list of event types and their occurrence times are given on the right in 
Figure 1. If the frequency threshold is 2 and the window in which the events can 
correlate is of length 3, then all event types, i.e.. A, B, C, D, are frequent. Thus, 
there are six candidates {AB, AC, AD, BC, BD, and CD) that can be frequent. 

If the distance constraint cq for distance do() is not required for occurrence 
validation, then four candidates appear to be frequent, namely AB, AC, BC, 



^ Of course, exceptions exist, but they are usually quite rare. And if this phenomenon 
is very rare, it would be pruned anyhow because of the frequency threshold. 
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A B ABC 

DCDCCCCCC 



1 2 3 4 5 6 7 8 9 

Fig. 1. A simple network with two nodes not directly connected to each other 
(on the left) and a set of sent alarms (on the right) 

and CD. The other two occur only once in the data set. The distance constraint 
do can be enforced by setting cq = 0, i.e., by demanding that to be valid, all 
events of an occurrence have to come from the same source. Thus, the number 
of frequent sets decreases to two. Then only candidates AB and CD occur twice 
inside a given time window and events in all occurrences of them are sent by the 
same node. 




2.3 Implementation Alternatives 

A straightforward way to apply distance constraints would be to prevent occa- 
sional occurrences of a candidate from increasing the candidate’s support. This 
can be done during the frequent episode computation in the algorithm, which 
we have adopted [9]. When the supports for candidates are computed, we check 
every possible occurrence of a candidate against a constraint, to see whether 
an occurrence is appropriate. If the occurence is valid, then we increase the 
corresponding support counter. 

There are, however, some drawbacks in this approach. Probably the most 
serious one is, that in the worst case it might take exponential time to check all 
possible instances of all possible episode candidates as has been discussed in [9] . 

Let us consider, for example, the alarm flow given in Figure 1 and there 
the candidate AC ^ There are eight possible occurrences of the candidate: 
{(^1,2, (^2,2), (^1,2, (^2,4), {A\^7, C2j), (Ai_7, C2,g), (^1,7j C2,s), (^1,7j C2,5),(Aij, 
Cl. 9 ), (Ai. 7, C2.9)}. Seven out of these are improper with respect to the defined 
distance constraint do(ei,ej) < cq, where cq = 1. The only proper occurrence is 
(Ai 7 ,Ci 9), where do(JVi,Ni) = 0. Altogether, there are 21 possible occurrences 
of candidates to be checked against the domain constraint and out of these only 
6 are proper ones. 

In the straightforward approach explained above, there are still plenty of 
unnecessarily composed impossible candidates. The events in these faulty candi- 
dates are introduced, for example, by sources that emit different types of events 
at the same time but are located so that used distance constraints would not 
allow events emitted by them to support any candidate episodes. All these can- 
didates have to be checked against the data. A lot of work can be avoided if it 
is possible to find nonoverlapping subsets from the data. 



^ In the following we denote an alarm A sent by node Ni at time moment 2 by using 
the notation A1.2. All other alarms are denoted respectively. 
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Definition 1. Two subsets Sk and Si of the dataset S are nonoverlapping iff 

1. there is no such event e that e € Sk and e € Si and 

2. there are no such events Ci G Sk and Cj G Si for which dmisource^ef), 
source(ej)) < Cm for any distance function dm G D and distance constraint 
am ^ ^ • 

In other words, nonoverlapping subsets of the data do not contain the same 
events and it is impossible for any reason that the events in separate nonover- 
lapping subsets could interfere with each other, i.e., be included in proper occur- 
rences of episode candidates. There can, however, be pairs of events in a subset, 
which together can not support a candidate episode. 

Nonoverlapping subsets of events can be found, if the distance constraints, 
which check the candidate occurrences, can be used to introduce nonoverlapping 
sets of sources. Thus, these event subsets can be separated from each other and 
frequent episode sets can be computed from each of them and summed up to a 
set of frequent episodes of the combined data set. The separation can be done 
with a straightforward algorithm (see Figure 2) in linear time with respect to the 
data size but in doubled space. The worst case is that there are separate network 
elements sending each of the alarms. Normally, however, this is not the case. 

1. SRC — \Jsource{e),e & S\ 

2. Divide SRC to source subsets SRCk, such that Vnei,nej G SRCk ■ dm{nei,nej) < 
Cm, for all dm G D and Cm G C, and there is no such network element na G SRCk 
that nci G SRCi, \i I k. 

3. forall source subsets SRCk do 

4. Initialize empty partition SsnCk 

5. forall events e £ S do 

6. Include e to S subset(^source(^e)) 

7. return multiset of SRCk', 

Fig. 2. Algorithm for finding nonoverlapping subsets 

The partitioning approach explained above can not be applied if there are no 
ways to separate nonoverlapping subsets from the data. 

A nonoverlapping subset Sk is said to be dense iff for every pair of events 
6i G Sk and 6j G Sk, dm{source{ei), source(ej)) < Cm for all distance functions 
dm G D and distance constraint Cm G C . In such a case, all event combinations 
may support a candidate episode. This is an optimal case of partitioning, in which 
there is no need to check candidate occurrences againts distance constraints. 

Let us add two new nodes to our network. The nodes and A 4 are added 
as shown in Figure 3. The distance function is the same as above: do{nei,nej) 
returns the number of steps from the node nei to the node ncj. The domain 
constraint sets the allowed number of steps to one, i.e, Cq = 1. Thus, the con- 
straint requires that in order to be proper, an occurrence of a candidate has to 
contain only such events, which are sent by neighboring nodes of the network. 
For example, events sent by N 2 and can not be in the same proper candidate 



occurrence. 




Domain Structures in Filtering Irrelevant Frequent Patterns 295 




Fig. 3. A network with four nodes. Solid lines represent direct connections and 
dashed line connection that is not direct 

It is possible to separate two nonoverlapping subsets of data from the alarm 
set. This is possible since events sent by A^i can not interfere with events sent 
by any other node in the network. The other part of the network contains a 
nonoverlapping subset of sources that can not be further divided to nonoverlap- 
ping dense subsets. The events sent by node can interfere with events sent 
by either of the nodes N 2 and N 4 , although events sent by N 2 and N 4 can not 
interfere with each other since they are not direct neighbors. The subsets are 
shown in Figure 4. 



Nj 



A B 



12 3 4 



5 6 



ABC 



7 S 9 



Nj DCDCCCCCC 

N, E E 

N, F F 

I 

123456 7 8 9 



Fig. 4. Alarms sent by nodes in two nonoverlapping subsets 

Separation to nonoverlapping subsets of data can be used to improve the 
effectiveness of the straightforward approach. When the separation has been 
done, each of the partitions is smaller than the original data set. Thus, there 
are fewer occurrences of possible candidates and the evaluation times of the 
restrictive constraints are reduced. 

While computing the frequencies of the episodes from the data partition that 
include only events sent by node A^i, there is no need to make any kind of 
domain constraint evaluation. This is because all three requirements set for a 
nonoverlapping dense subset hold in this partition. In the other partition the 
domain constraint has to be evaluated. If we evaluate it during the frequent set 
computation, the constraint has to be evaluated with 17 occurrences out of which 
7 occurences are proper. If it, on the other hand, is evaluated afterwards with 
occurences of only those episodes, which are frequent if the domain constraint is 
omitted, then there will be 14 occurrences to be evaluated. This is because out 
of six candidates {CD, CE, CF, DE, DF, EE) only four {CD, CE, CF, EE) are 
frequent without the domain constraint. 
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3 Experiments with Domain Knowledge 

3.1 A Structure of the Network 

A GSM network is organised as functional groups of network elements. In each 
group, there is one base station controller (BSC) that controls the internal behav- 
ior of the group and interface to other groups. The group has an internal tree form 
structure as is shown in Figure 5. These groups are called BSC groups below. 



It is more probable that the alarms emitted by the network elements con- 
trolled by the same BSC are related than those alarms coming from sepa- 
rated groups. Therefore the distance measurement c?o that is used with the 
following alarm data sets, is defined as a number of archs in control hierar- 
chy that are between the network elements sending the alarms. For example, 
do{NEjki,BSCj) = d,Q{N Ejkip, N Ejkin) = 2 and do{N Ejki, BSCi) = 3. 

3.2 Telecom Event and Alarm Data Sets 

The first data set. A, contains 55690 events with 295 different event types. They 
were emitted by 19 BSC groups located around a large geographical area. The 
period that is covered by the data set is 15 days. Each of groups has an internal 
structure like the one shown in Figure 5. Events are sent by network elements 
within the groups. The distance measure is defined as explained above. 

In the search phase, we used time and distance measure do() to constrain 
episode occurrences. Only those episode occurrencies were searched for, which 
were coming during one hour from network elements within distance cq, where 
Co = 0, 1, 2, 3, 4 and 5. 

The second data set, B, contains 25000 alarms consisting of 115 types and 
emitted by 28 BSC groups during 38 days. The distance measure was the same 
as with the first data set. 

3.3 Restrictive Power 

The restrictive power of the control hierarchy distance was tested by setting the 
window size of the episode algorithm to a large value and by changing the limiting 




Fig. 5. A network with two BSC groups 
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distance that was used as domain constraint cq. The test was run separately with 
the first two data sets, A and B. With the first data set, the support was set to 
1 and with the second to 2. For the first set, only the episodes of size two were 
composed. 



Table 1. Summary of the restriction test results 
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1813 
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- 
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7 






- 


- 


- 


- 


- 



From Table 1 we can see that while the distance restriction is loosened, the 
size of the resulting set of frequent episodes is growing quite rapidly. However, it 
keeps well under the corresponding counter of the situation where the restriction 
is not applied at all. 

The results shown here actually support the common rule of thumb that only 
those alarms or events that were emitted by the direct neighbors in the control 
hierarchy are meaningful. When the distance constraint is changed from 1 to 
2, the amount of the frequent candidate sets increases rapidly. Candidates that 
were generated with larger distances than 1 are more vulnerable for accidental 
occurrences. The bigger amounts of frequent episodes made also the computation 
times to grow rapidly and therefore the larger frequent episodes were omitted. 

We used also a third data set, C, that is an application log of a large software 
system. It gave us similar results as sets A and B. 

3.4 Quality Improvements 

The quality of frequent sets was evaluated by studying the largest frequent 
episodes found. From each set, it was checked whether it contained only related 
event types. Evaluation was based on an assumption that a frequent episode is 
of good quality if it contains only such event types, which can be caused by the 
related functions under some circumstances. 

The results with all of the test sets were encouraging. There were much less 
clearly unrelated combinations of events in all the frequent episode collections. 
Especially with the data set C, when the distance restriction was not applied, 
all the larger covering sets were combinations of unrelated events. On the other 
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hand, when the restriction was used there were fewer frequent sets and only 
meaningful combinations. 

With the test sets A and B the situation was somewhat similar. However, 
there were also frequent episodes that consisted of unrelated event types, but 
the amount of them was quite small. 



4 Generalization of the Approach 

In a complex environment like telecommunication network the role of the domain 
knowledge becomes important. Especially, information about different structures 
in the network is needed in order to understand the data that the network pro- 
vides. For example. Figure 6 shows a small imaginary network that is constructed 
in a small town. The town is built around two large water areas. 




Fig. 6. A network in an urban area 

Cells in the network are grouped under three controllers so that each group 
would be as homogenous as possible and that there would be only a few han- 
dovers from one group to another. In the network there are three alarming cells, 
marked with dark color. All alarms that they emit are collected to a centralized 
monitoring unit, where they are analysed by a human expert. 

When an expert looks at a map with the alarming units, there are several pos- 
sible relationships that he must consider before he can make his decision about 
what is the reason for the alarms. The first thing to consider is the geographical 
distance of the alarming cells: are they close to each other? If they are, as is the 
case with the alarming cell couple in between the two gulfs, would it be possible 
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that they actually share the reason for their behavior? On the other hand, as is 
the case between the couple and the third alarming cell below the gulf, would it 
be possible that the cells are interfering with each other? This might be possible 
since open water area might carry signals quite far away. 

In addition to the geographical relationships, there might be other functional 
connections between cells. In Figure 6 there is, for example, a transmission net- 
work depicted with black dotted lines. A structure of a transmission network is 
not necessarily reflecting at all in the organization of the radio network. They 
might even be operated by different operators. However, if a transmission line is 
broken, it immediately affects not only to its both end points but also to all the 
traffic that has been routed through it. Therefore, a single wire cut might make 
half of the network to alarm vigorously. 

4.1 Distance Types 

In a general case a data set S can be seen as a collection of events Cj so that each 
event has attached properties, which include event type E and its occurrence 
time t. Traditionally event time and type have been seen more important as the 
other properties such as cancellation time and severity of the event. However, 
this is not necessarily always the case. Any one of the properties can be used 
as a target for pattern mining. Depending on the application and the data set, 
some other properties might be more informative. 

In earlier analysis [5,9,10] the time has been used as the only property that 
separates possible event patterns from each others. In this article also domain 
structures have been introduced for the same purpose. In general, there might 
be several different types of constraining distance measures that can be used. 
The most important ones are property distances like distance in time, domain 
distances like the distances introduced in this article, and characteristics dis- 
tances like frequency of the type in the data set. The fourth type of constraints 
are the ones that are deduced from the previous mining results or that are set 
by the user, in order to reduce the amount of results and to focus on the most 
interesting phenomena. 

Property distances are distance measures that are defined using event prop- 
erties. Such a property can be, for example, beginning or cancellation time of an 
event. These can be computed without seeing anything else than a data set or 
part of it. The distance between beginning or cancellation times of two events 
might be computed without knowing anything else about the domain. 

Domain distances are defined by using additional domain knowledge, for ex- 
ample, a control structure of a telecommunication network. In order to be able 
to compute distances over these structures, additional information of the do- 
main in a form of a model M is needed as was discussed in Chapter 2.2. There 
a distance dp of two events was computed by counting control hierarchy steps 
between them: do{source{ei), source(ej)). 

Characteristics distance is defined by the local statistical characteristicts that 
can be computed from the data set S. For example, frequencies of event types or 
average activity times of its instances can differ from each others so much that it 
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is obvious that they can not relate to each other. Another such a characteristic 
that can separate event types, is the distribution over time. An event type can 
occur all over the data set from its beginning to its end. On the other hand, an- 
other event type might be as numerous in the data but it might be concentrated 
in a short peak. 



4.2 Generalized Support 

Values of an event property propi form a value set Vpropi- Each value Vj in a 
set can be understood as a node. In many cases, it is possible to define a graph 
between these nodes. For example, values of event property beginning time form 
a graph that is directed and where a total order between the nodes can be 
defined. A control hierarchy of a network, on the other hand, forms a partially 
ordered directed graph, while geographical locations of network elements can be 
understood as an undirected graph. 

The notion of support can be generalized so that different types of value 
graphs can be used. The original support was defined as a number of time win- 
dows, where an instance of a pattern occurs. Thus - as is shown in Figure 7 - 
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Fig. 7. A time window that has traversed over a graph of time points. At each 
time point there has been one event 

a time window was moved over the time value graph. In each position of the 
window all the events, whose beginning times were inside the window, were in- 
terpreted to occur in the window. Correspondingly, it is possible to use other 
value graphs for this purpose. For example, in a control topology a window can 
be defined to contain all network elements that are not more than maximum 
number of steps away from the starting node. The window can then be moved 
stepwise from the root node towards the leaves so that each branch is visited 
once. In Figure 8, windows of size one step in a control topology, are shown. 
When defined this way, a control topology support of a pattern tells in how 
many different network element branches a certain event type combination oc- 
curs. This information is very useful, for example, in order to evaluate the reason 
for a malfunction. 

A special case for the generalized support are such property value sets, in 
which it is impossible to define any kind of graph between the values. Such a 
value set actually gives a natural partitioning of the data set. In such a value 
set only those events whose values of a given property are equal, can support a 
pattern. 
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Fig. 8. A topology window that has traversed over a topology graph. Sets of 
events have been attached to the elements that have sent them 

It is possible to use value graphs of different properties to prune out im- 
possible or uninteresting candidates. For example, an alarm combination that 
occurs only in one network element and its direct children, migh not be as in- 
teresting as a combination that occurs in several network element branches. So 
there might be a threshold Sp^opi for each property propi. This leads to a very 
interesting result: to be covering a pattern p must have support prop tip) > Spropi 
for all i = 0..n, where n is the number of properties used for pruning. Because 
of this, in order to get an exact answer to a search for valid patterns, we must 
first check each possible occurrence of a pattern and if it is valid in respect to 
all the constraints, only after that one should update all the support counters of 
the pattern. 

Fortunately it is also possible to compute an approximation of the set of valid 
patterns. This can be done by using each support threshold to compute a corre- 
sponding set of covering patterns, where only that support has been checked. The 
resulting sets of covering patterns are called as margin sets. In order to be cover- 
ing in all respects, it must apply for a pattern pi that supportpropj {Pi) > Spropj 
for each j = 0, ...,n. In other words, the pattern has to have all of its property 
supports greater than the corresponding thresholds, i.e., it has to belong to all 
of these margin sets. It is in many cases much faster to compute all these margin 
sets separately and take an intersection of the results - a so called approximation 
set - than to check each occurrence of all the candidate patterns in all respects. 

This makes it possible to optimize the pattern computation by selecting 
the order in which the supports are computed so that those thresholds that 
are most powerful in pruning candidates or are easiest to compute are computed 
first. 

The problem here is, of course, that there might be such patterns in the 
approximation set of covering patterns, whose property supports are greater 
than corresponding thresholds, but whose occurrences actually are not valid. 
Therefore, if an exact answer is needed, the occurrences of the resulting patterns 
must be validated against the data set. Fortunately, the amount of occurrences 
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to be validated is much smaller than if the validation would have been done in 
all respects at once. 

4.3 Tests with Generalized Support 

With our test set B (see Chapter 3.2) we compared the pruning power of time 
window support and topological support. The topological support was based on 
a window that was sliding from the root node to element branches. The results 
of the experiment are given in Figure 9. It shows how the number of covering 
pairs changes while the time support threshold is increased. The constant value 
in the figure is the number of covering pairs that were computed while the topo- 
logical support was set to 1 . The curve Intersection gives the amount of covering 
pairs that were left to the approximation set. The approximation set was formed 
by taking intersection of pairs found with pure time support and pairs found 
with topology support. The curve Integrated gives the amount of covering sets, 
which were covering with respect to both support thresholds and whose all oc- 
currences were checked against distance constraints. Already the approximation 
set improves accuracy of the answer by cutting the size of the resulting set to 
half compared to the time support margin set. Also the topology support margin 
set gives better results if compared to the low time support threshold. 

When the topology support threshold is further increased, the size of the 
approximation set decreases. This can be seen, for example, in Figure 10. In this 
domain it makes quite a lot of sense to do this - especially when the informa- 
tion need is to find such alarm combinations that occur often in the network 
and which contain the same set of alarms every time. If we are using only the 




Fig. 9. Comparison of sizes of marginal sets, approximation set and exact results 
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Fig. 10. Improved pruning results with a topology support 

time support to prune candidates, it integrates away all such occurrences of the 
patterns that are coming from different sources but at the same time point. It 
gives us much more focused results when we can use relatively low time support 
together with a relatively low topological support and still keep the result set 
focused. 

In windowing approach it varies how different graph types emphasize events 
attached to different parts of the graph. For example, in a tree structured graph 
like the one used in our experiments, where the depth of graph is relatively low, 
the nodes in the middle get more weight than leaves or roots. On the other hand, 
if the leaves contain a lot of the same types of events, the situation is balanced. 
In any case, the root and its direct descendants migh be discriminated. This is 
because the root might be covered only by a few windows while there are lots of 
windows covering middle parts of the branches or the similar leaves. However, 
this is not a problem in our domain, where the events sent by low level network 
elements usually form the majority of events in the data set. They are the first 
ones to be correlated or combined and removed from the data. 



5 Conclusions 

Domain knowledge and especially different distance measures between domain 
objects can effectively be used in pruning impossible frequent episodes and other 
types of patterns. They can be used in two ways: they might provide efficient 
natural partitionings to the data. These partitions can then be used to minimize 
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the algorithm execution times. On the other hand, and what can be considered 
even more important, with them one can prune unnatural and impossible com- 
binations of events away from the result set and thus help to ease the burden of 
a human expert, who is analysing the telecommunications network event logs. 

In the near future, we will formalize the notion of general support. We will 
also continue experiments on different datasets and domains. We will study more 
closely the time complexity as well as the usability of the approximation set. 
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Abstract. In this paper, we propose to investigate the notion of in- 
tegrity constraints in inductive databases. We advocate that integrity 
constraints can be used in this context as an abstract concept to encom- 
pass common data mining tasks such as the detection of corrupted data 
or of patterns that contradict the expert beliefs. To illustrate this possi- 
bility we propose a form of constraints called association map constraints 
to specify authorized confidence variations among the association rules. 
These constraints are easy to read and thus can be used to write clear 
specifications. We also present experiments showing that their satisfac- 
tion can be tested in practice. 



1 Introduction 

Integrity constraints are a central notion in databases used primarily to ensure 
data consistency. It has shown to be a fruitful and useful concept, with important 
additional benefits to guide very different aspects such as design, implementation 
and also query optimization (see [21,1] for an overview). 

Basically, integrity constraints are an abstract specification of the possible 
contents of the database with respect to our current knowledge of the data do- 
main. They have been deeply investigated in the context of relational databases 
as well as object-oriented databases, according to various objectives (e.g., spec- 
ifications, efficient checking). 

Recently, the concept of inductive database (IDB) has emerged [13,15,8], 
promoting the vision that a database dedicated to data mining contains not 
only data (e.g., customer transactions) but also all patterns that hold in the 
data (e.g., association rules [2]). Ideally, the user of an IDB can query data and 
patterns within a single language and can also express operations involving both 
data and patterns. The collection of patterns may be several orders of magnitude 
larger than the set of data itself, and thus cannot be materialized in general in 
the IDB. However from the user point of view, each pattern that holds should 
be considered as available for querying. 
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We advocate in this paper that integrity constraints are also a very promising 
concept in IDB. Like in the classical database frameworks, they can be used for 
specifying data consistency and for rejecting inconsistent updates. However, in 
the context of IDB, integrity constraints raise new interesting and challenging 
issues, if we consider that they can also be applied to the patterns that hold 
in the data. Regarding this view, they can be used to specify what knowledge 
the designer (or an expert) considers to be reasonable to find in the data. Then, 
the violation of a pattern integrity constraint may be seen as an evidence of 
various phenomena. For example, if we have an IDB containing alarm logs with 
a daily insertion of a batch of new logs, then if after such an update one of the 
pattern integrity constraints is no longer satisfied this may highlight that this 
set of log records has not been properly cleaned. In this case the IDB engine can 
abort and undo the insertion, and let the IDB administrator (or a user) check 
these new logs. Another useful possibility is to consider that the designer/expert 
specifies intensionally with pattern integrity constraints the set of patterns that 
in her/his opinion could be found in the data. This provides a way to delimit 
the acceptable laws that could hold with respect to the knowledge that the 
designer/expert has about the domain. In this case an integrity violation can be 
assimilated to the occurrence of an unexpected phenomenon and the patterns 
violating that constraint can be considered as subjectively interesting piece of 
information for the designer/expert. Obviously, in the context of a multi-user 
IDB, such integrity constraints on patterns can be customized by each user, so 
that she/he can add more specific constraints than the ones set by the designer, 
to reflect her/his own belief and background knowledge. 

The detection of corrupted data and the identification of new interesting 
knowledge among the extracted patterns are common tasks in data mining (see 
for example the classification of actions proposed by [19] when beliefs are con- 
tradicted). The idea, that we want to point out in this paper, is that large parts 
of these processes can be incorporated nicely in the IDB framework by means of 
integrity constraint specification and checking. 

Of course, most forms of integrity constraints proposed previously in the 
database domain can be reused to specify the contents of an IDB in terms of 
tuples or objects (e.g., functional dependency, class inclusion hierarchy). And 
these constraints can be used directly to specify the data that are admissible 
in an IDB but also to specify the admissible patterns themselves, when these 
patterns are encoded as tuples or objects. So, at first sight, we can imagine to 
choose one of the very expressive languages already proposed in the literature 
(e.g., using a data manipulation language itself [21]) and use it to specify a large 
class of constraints over the patterns. The drawback of this approach is that it 
does not take into account the tradeoff between expressivity and computational 
complexity of constraint checking in the context of IDB. 

For example using a Datalog like language with a polynomial evaluation 
complexity (w.r.t. the number of tuples in the database) to express constraints 
may be reasonable in a relational database, but will be in general not applicable 
in practice for IDB. The reason is that the number of patterns stored in an IDB 
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(in a materialized way or not) is for common families of patterns inherently 
exponential with respect to the pattern domain parameters. For instance, if 
we consider patterns called frequent itemsets, they are defined w.r.t. a set of 
binary attributes A, and a frequent itemset may be any subset of A, leading in 
the worst case to a collection of 2^\ patterns. Even if this number can remain 
reasonable in some practical cases (e.g., using a high frequency threshold on a 
sparse data set), when we set more difficult conditions (e.g., a lower frequency), 
all practitioners have had to deal with the problem of the exponential growth of 
the number of frequent itemsets extracted. The same problem can be illustrated 
on other commonly used patterns (e.g., association rules [2], frequent Datalog 
patterns [11]). So, we cannot expect to be able to apply a general integrity 
checking process (even one ensuring a polynomial evaluation complexity) on 
this set of patterns of exponential size. 

The situation can be even worse since in most cases, in an IDB these patterns 
are not fully materialized, and thus some extra (in general non-polynomial) 
computation is needed to enumerate them. 

In the context of IDB we propose to investigate the notion of integrity con- 
straint for patterns, by taking advantage of the following observation. In IDB 
each pattern is an expression of a specific pattern domain with its own semantics 
and thus could come with its specific family of integrity constraints, offering an 
acceptable tradeoff between expressivity and evaluation cost. 

In the rest of the paper we focus on a very common pattern called associ- 
ation rule [2] and we propose a dedicated form of integrity constraints called 
association map constraints. 

Association rules were proposed to represent dependencies between the oc- 
currences of items in customer transactions. Originally, the form of these rules 
was Al, A 2 t A'i, . . . B where Ai, A 2 , A^, . . . and B denote items. The left 
hand side is called the antecedent, and the right hand side the consequent . A 
confidence measure is defined for these rules. The value of the confidence could 
be considered as the conditional probability of having the consequent in a trans- 
action when we have all items of the antecedent. Another quality measure, called 
relative support, is generally associated to the rules. A 10% relative support for 
a rule means that 10% of the observed transactions support the rule, i.e., the 
items (antecedent and consequent) could be observed together in 10% of the 
transactions. It should be noticed that mining association rules is not restricted 
to basket data analysis, and has been applied on many kinds of data sets after 
an appropriated encoding with Boolean variables (e.g., [20]). Association rules 
have received a lot of attention and several algorithms (e.g. ,[18,3, 12]) have been 
designed to extract them for given confidence and support thresholds. 

An association map is an abstract specification of the set of association rules 
that could hold in the data according to our current knowledge of the data do- 



^ Consequents made of several items are also considered in the literature. The notion 
of association map constraint proposed in this paper can be adapted easily to this 
other form. 
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main. Association maps are a good candidate of dedicated forms of integrity con- 
straints since they are very concise and readable in the following sense. Firstly, 
a small association map is sufficient to constrain a huge collection of association 
rules. Secondly, an association map has a strong hierarchical structure enabling 
quick intuitive browsing while its semantics remain very simple. And finally, 
another interesting property of association maps for their use as integrity con- 
straints is that their satisfaction can be checked in a reasonably efficient way in 
practice. 

The rest of this paper is organized as follows. In Section 2 we informally 
present the notion of association map constraint. More formal definitions and an 
algorithm to compute association maps are given in Section 3. In Section 4 we 
describe experiments showing that these constraints can be checked efficiently 
in practice even in difficult cases. We review related work and conclude with a 
summary in Section 5. 

2 Informal Presentation 



In this section we introduce in an informal way the notion of integrity constraint 
based on association map for IDB. 

The key idea behind association map is to represent what should be the con- 
fidence variation if a particular item is added to or removed from the antecedent 
of a rule. Let us take a toy example where each transaction in the data set of 
the IDB describes one person involved in a car crash (her/his characteristic, the 
context, the damages). 

We suppose that the designer has some knowledge in the car crash domain 
and wants to use it as integrity constraints over the association rules she/he 
thinks that could reasonably hold in the IDB. We make the hypothesis that there 
is a wide variety of such knowledge that can be expressed as the variation of rule 
confidence w.r.t. the presence/ absence of a particular attribute in the left-hand 
side of the rule. For example, consider that for car crashes the expert thinks that 
the use of an airbag reduces the probability of severe injury, except for persons 
that wear glasses. This opinion can be seen as a constraint (denoted IC\ below) 
on the variation of the confidence of rules concluding on severe injury, w.r.t. a 
variation criterion which is the presence/ absence of an airbag. 

Consider the following association rules that hold (among others) in the cur- 
rent instance of our IDB. For each rule we indicate the corresponding confidence, 
and one can easily see that this set of rules does not contradict the integrity con- 
straint I Cl set by the designer. 



0 ^ 

airbag 

driver 

driver, airbag 

wear glasses 

wear glasses, airbag 

wear glasses, driver 

wear glasses, driver, airbag 



severe injury 20% 
severe injury 10% 
severe injury 18% 
severe injury 10% 
severe injury 15% 
severe injury 20% 
severe injury 20% 
severe injury 25% 
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An association map is simply an explicit synthetic representation of these 
confidence variations in terms of effects of the presence/absence of the variation 
criterion in rule antecedents. A map is defined for a given consequent (e.g., severe 
injury), for a particular item used as variation criterion (e.g., airbag) and for a 
given support threshold, but without any confidence threshold. It contains a set 
of regions, where each region is characterized by an homogeneous effect on rule 
confidence when we add the variation criterion to the rule antecedent. The effect 
is called positive (resp. negative) if the addition of the variation criterion results 
in an increase (resp. a decrease) of the rule confidence. A region is delimited by 
a lower bound (w.r.t. set inclusion) which is a rule antecedent (a set of items) 
called base. Upward, a region is delimited by a border composed of the rule 
antecedents that are the minimal supersets of the base where the effect changes, 
from positive to negative or from negative to positive (neutral effects are not 
considered as real changes) . And finally, all elements in a border of a region can 
be themselves the bases of new regions. Additionally, it should be noticed that 
rules having a support lower than the given threshold are not represented by the 
map. 

The constraint ICi can be expressed as the association map depicted on 
Figure 1. If we consider all possible rule antecedents (excluding items severe 
injury and airbag that represent the rule consequent and the variation crite- 
rion), these antecedents can be organized in a lattice (w.r.t. set inclusion). This 
lattice is depicted using dashed lines on Figure 1. The bases and the border 
elements are simply particular elements of this lattice that delimitate the re- 
gions of homogeneous effects. For constraint ICi we have two such regions. The 
first having for base the empty set and as border {wear -glasses} , and in which 
the effect is negative. And the second, with base {wear -glasses} and border 
{wear-glasses, driver} where the effect is positive. On the graphical representa- 
tion, the space of all supersets of a base is sketched by a conic shape. The map 
presented on Figure 1 can be read as follows: If we add airbag to the antecedent 
of 0 severe injury then the confidence decreases, and this holds for all rules 
excepted if the antecedent contains wear-glasses in which case the confidence 
increases. 

Suppose that a new set of transactions representing data related to pregnant 
women is inserted in the IDB, and that now we have the additional rules^: 



pregnant 
pregnant, airbag 
driver, prregnant 
driver, pregnant, airbag 

driver, pregnant, less -than-H-month-pregnancy 
driver, prregnant, less-than-i-month-pregnancy , airbag 
wear glasses, prregnant 



severe injury 30% 
severe injury 25% 
severe injury 30% 
severe injury 40% 
severe injury 19% 
severe injury 12% 
severe injury 35% 



^ To simplify the example we suppose that the previous rules still hold with the same 
confidence. 
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Fig. 1. Association map IC\ 

wear glasses, pregnant, airbag severe injury 40% 

wear glasses, driver, pregnant severe injury 35% 

wear glasses, driver, pregnant, airbag severe injury 42% 

wear glasses, driver, pregnant, lessdhanJijmonthjpregnancy => 

severe injury 23% 

wear glasses, driver , pregnant, airbag, lessdhanJijmonthjpregnancy => 

severe injury 28% 

We recall that in the context of an IDB these rules are not necessarily ex- 
tracted and materialized after the insertion of the new data, but from the user 
point of view they can be used/retrieved at any time. 

If we have a close look at these rules, we notice that some of these patterns 
no longer satisfy ICi. This can be seen more clearly by drawing the association 
map corresponding to the whole new set of association rules (for the consequent 
severe injury, the variation criterion airbag, and the same support threshold). 
This map is depicted on Figure 2, where, for readability reasons, the underlying 
lattice has not been represented. Testing if ICi is satisfied or not can then be 
performed by comparing this map to the map corresponding to I Ci. This is done 
on Figure 2, where the area of patterns that violate IC\ is highlighted in grey. 

In practice, the use of association maps as integrity constraints in an inductive 
database can be made as follows. First, the user or the database designer gives a 
collection of association maps to specify the authorized confidence variations in 
terms of known effects for specific variation criteria and rule consequents. After 
an update (or sequence of updates) of the data, the maps describing these effects 
are computed (from the data) by the inductive database system. Then, they are 
compared automatically to the ones that have been specified. If a difference 
is found, the system rejects the update(s) and presents this difference to the 
user (using eventually a graphical representation with highlighted areas as the 
one of Figure 2). The user can then assess whether the difference comes from a 
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Fig. 2. Patterns that do not satisfy ICi 

corruption of the data or is due to effects that are not correctly specified by the 
integrity constraints. In the later case, this can leads to the modification of the 
integrity constraints and be a clue to find an unknown phenomenon. 



2.1 Refinement of Maps 

The notion of association map presented informally can lead on real data sets 
to many regions that are not appropriated. We introduce in this section two 
thresholds used to avoid such situations. 

Discarding Extra Regions Using Strong Dependencies. Many exact or 
nearly exact association rules hold in real data sets and this phenomenon has 
been used recently to condense huge collections of itemsets [16,7]. 

Let us consider that we generate a map for consequent C and variation cri- 
terion H, and that we have between items A and B the association A ^ B 
with a confidence of 100 % (such a rule can be due, for example, to a functional 
dependency holding in the data) . Then A ^ C and A,B^C have the same 
confidence, and this is also true for rules A,H ^ C and A,B,H C. Thus, 
the effect of H on confidence is the same for antecedents {A} and {A, i?}. More- 
over, the same holds for any pair of antecedents X and X U {B}, where X is a 
superset of {A}. This means that the portion of map generated for supersets of 
X is redundant with the part constructed for supersets of X U {B}. 
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Now, suppose that {B} is the base of a region, and that the effect of H for 
{A, B} is different than the effect of H on {B}. In this case, {A, B} turns out 
to be in the border of the region based on {B} and results in the generation 
of an extra region based on {A,B}. Since we know that the part of the map 
corresponding to supersets of {A, _B}, is redundant with the one for supersets of 
{A} we can avoid the construction of the extra region based on {A, _B}. For any 
region based on a y superset of {i?} we can make the same simplification when 
the region based on y U {A} has a different effect. 

So, we can discard any base X if there exists Y G X and A G AT \ y such 
that y {A} with a 100% confidence (i.e., if there is an exact rule between 
items in X). 

It should be noticed that exact rules are not likely to be found in noisy 
data sets or in presence of missing values. In these cases, they appear under the 
form of rules having a few number of exceptions. So, in the definitions given in 
Section 3.1 we discard a base X if there is a nearly exact rule between items in X . 
These nearly exact rules called ^strong rules (rules with at-most 6 exceptions) 
have been used previously in a different context [7] to condense collections of 
itemsets and mine frequent patterns more efficiently. 

For association map extraction, 6 will be a threshold called freeness. 



Avoiding Regions Created as Artefacts 

A confidence is the ratio si/s 2 of two integer support values. So, it cannot change 
in a continuous way, but only by discrete steps. 

Let a be the absolute support threshold used to generate the rules. The great- 
est discrete step variation due to a single row is encountered when confidence 
jumps from (cr-|- l)/(cr-|- 1) to cr/(cr-|-l). Thus, a confidence variation lesser than 
1 — cr/(cr -I- 1) cannot be considered as really significative. 

So, we use another threshold r called tolerance to indicate what we consider as 
a clear confidence variation. When we add the item used as variation criterion to 
the antecedent of a rule, the variation of confidence must be strictly greater than 
T (resp. strictly lower than — r) to be interpreted as a positive (resp. negative) 
effect. Otherwise the effect is said to be neutral. 

The bases of regions are restricted to be such that their effects must be either 
positive or negative, except for the first base (the empty set) where the effect is 
also allowed to be neutral. Thus, in a region, when we encounter a neutral effect 
we consider that we are still in the same region, and it is only when we find a 
different and significative positive or negative effect that we generate a border 
element . 



3 Computing Association Maps 

In this section, we give more formal definitions and present a way to compute 
association maps. 
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3.1 Definitions 
Preliminary 

Definition 1 (Binary Database). Let i? be a set of symbols called items. An 
itemset is a subset of R. A binary database r over i? is a multiset of rows, where 
a row is an itemset. We use the notation t € r to denote that a particular row t 
belongs to r. 

In this section, we assume that the data set is a binary database r over a set 
of items R. 

Definition 2 (Itemset Support). We denote Ai(r,X) = {t G r\X C t} the 
multiset of rows in r matched by the itemset X and Sup{r,X) = \Xi{r,X)\ the 
support of X in r, i.e., the number of rows matched by X. 

Definition 3 (Association Rule [2]). Let Y C i? be an itemset. An asso- 
ciation rule over Y is an expression of the form X ^ C, where C G Y and 
X C y \ {C}. The support of a rule in r is denoted Sup{r,X C) and is 
defined by Sup(r,X C) = Sup(r,X U {C}). Its confidence is Conf(r,X 
C) = Sup{r,XU {C})/Sup{r,X). 

We consider a G (0, |r|] a support threshold. It should be noticed that it 
corresponds to an absolute number of rows. However to facilitate the reading of 
some examples we also use a relative support threshold, that simply corresponds 
to cr/|r|. 

Definition 4 (Frequent Association Rules). A frequent association rule 
over R w.r.t. a and r is an association rule X ^ C over i?, such that Sup{r, X 
C) > a. We denote FreqRules{r, a) the set of all frequent rules over R w.r.t. a 
and r. 

We also recall the definitions of ^strong rules and 6 — free sets, needed to 
define association maps. These two notions have been introduced in a different 
context^ in [7,6]. 

Definition 5 (^-Strong Rule). A 6 -strong rule^ in a binary database r is an 
association rule X ^ C over R such that Sup{r,X) — Sup{r,X U {C}) < 6, i.e., 
the rule is violated in no more than 6 rows. 

In this definition, 6 is supposed to have a small value, so a 5-strong rule is 
intended to be a rule with very few exceptions. 

Definition 6 (5-Free Set). A C i? is a 6-free set w.r.t. r if and only if there is 
no 5-strong rule over X in r. The set of all 5-free sets w.r.t. r is noted Free{r, 5). 



® Originally 5 — free have been proposed as a condensed representation that can be 
extracted very efficiently and that can be used to closely approximate the support 
of all itemsets that are frequent w.r.t. a given support threshold. 

^ Stemming from the notion of strong rule of [17]. 
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Since 6 is supposed to be rather small, informally, a 5-free set is a set of 
items such that these items are not related by any very strong positive depen- 
dency. 



Effect Regions and Association Maps. As presented in Section 2, an associ- 
ation map is defined w.r.t. two items (the consequent and the variation criterion) 
and three thresholds (support, tolerance and freeness). In this section, we denote 
respectively C € R the consequent and H G R the variation criterion, and we 
use cr G (0, |r|], r G [0, 1] and an integer 6 to represent respectively the support, 
the tolerance and the freeness threshold. 

Definition 7 (Local Effect). Let X C R \ {C,H} be an itemset. The local 
effect of H on rule X ^ C, denoted LocEffect{r, t, X, C, H) is defined as: 



LocEffect{r, r, X, C, El) 



1 if Confir, X\J{H}^C) - 
Conf{r, X^C)>T 
< —1 if Confir, X U {H} C) — 
Confir, X^C) <-T 
0 otherwise 



According to its value, the effect is respectively called positive, negative or 
neutral. 

We now define the antecedents that are significative to generate the maps. 

Definition 8 (Significant Antecedent). SigAnte{r,a,T,6,C, H) is the col- 
lection of significant antecedents and is defined by SigAnte{r, a, r, 6 , C, H) = 
{X C R\iXU{H} C) G FreqRulesir,a)f\X G Ereeir,8)f\LocEffectir,T,X, 
C,H)&{-1,1}}. 

These antecedents are itemsets that form with H the antecedent of a frequent 
rule. Moreover, they must be made of items that are not strongly dependent 
(i.e, they are ^-free) and where the local effect is clearly positive or negative 
(not neutral). Then, for an itemset X we define the border effect, which is 
the collection of the minimal supersets of X that are significant antecedents 
and where the local effect changes strongly (from positive to negative or from 
negative to positive). 

Definition 9 (Effect Border). Let X be an itemset such that X C R\{C,H}. 
The effect border for X is Borderir,a,T,8,X,C, H) = {Y € R\X C E A T G 
SigAnte{r, a, r, 8, C, H) A LocEffect{r, r, X, C, H) yf LocEffect{r, r, Y, C, H) A 
(VZ, X CZ CY => LocEffect{r, r, Z, C, H) G {0, LocEffect{r, r. A, C, H)})}. 

Now we consider regions of homogeneous effect, i.e., a set of rule antecedents 
having a common subset and a common local effect. We first define their lower 
bounds (w.r.t. set inclusion), called effect bases as follows. The empty set is an 
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effect base. A significant antecedent which is in the effect border of an effect base 
of smaller size is also an effect base. This notion is expressed more formally by 
the next definition. 

Definition 10 (Effect Base). The collection of effect bases is defined induc- 
tively as follows. 

Baseo = {0} 

Basci = {X C i?| [Xj = iAX G SigAnte{r,a,T,6,C,H) A (3T G [jj^i Bascj, 
X G Border{r, a, r, 6, Y, C, H))} 

Base{r, a, r, 6, C, H) = BasCi- 

Then an association map is simply the collection of all effect bases together 
with their borders. 

Definition 11 (Association Map). An association map for a binary database 
r w.r.t. items C,H and thresholds (t,t,6 is defined by AMap{r,a,r,6,C, H) = 
{(A, B) \X G Base{r, a, r, 6 , C, H) A B = Border{r, a, r, 6 , X, C, H)}. 

Each tuple in an association map corresponds to the lower and upper bounds 
(w.r.t. set inclusion) of a region where the local effect does not significatively 
change. 

Note that two different effect regions may overlap. This overlapping may 
occur even when their respective effect bases have opposite local effects, in this 
case the itemsets that belong to both regions have a neutral local effect. 

3.2 Algorithm 

We present a generic algorithm called GenMap to produce the map for a con- 
sequent C, a variation criterion H , and thresholds 5, a and r corresponding 
respectively to freeness, support and tolerance thresholds. 

The algorithm calls three functions: CandAnte, Signif and Effect. 

The algorithm is presented using in its input a set S' = FreqRules{r, a) of 
all frequent association rules along with their supports. 

Algorithm 1 {GenMap) 

Input: G, H items, n the size of the largest candidate antecedent, set S, 
thresholds 6, a and t. 

Used subprograms: CandAnte{S,i,G, H) establishes the set of itemsets of 
size i, not containing G or H , that are candidates for being significant an- 
tecedents. Signif {S,X,C,H, T, 5, (t), which finds out whether X is a significant 
antecedent or not. And the function Effect{S, X, G, H) is used to compute the 
local effect of H for X. 

Output: a set of tuples containing all effect bases, and their corresponding 
effects and borders. 
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1. let E^:=Effect{S,^,C,H),Map:={{^,E^,^)}; 

2. for all i G {1, . . . , n} do 

3. for all X G CandAnte{S,i,C, H) do 
4 - it Signif{S,X,C, H,t, 6, a) then 

5. let Ex ■■= Effect{S, X, C, H) ; 

6. let MaxSubBasesx ■= € Map\ 

Y C X A Ey ^ Ex G By,W X}; 

7. if MaxSubBasesx ^ 0 then 

8. let Map := MapU {{X, Ex, ib)}; 

9. for all (Z,Ez,Bz) G MaxSubBasesx do 

10. let Map := {Map\ {{Z, Ez, Bz)})U 

{{Z,Ez,BzU{X})}; 

11. od 

12 . fi 

13. fi 

14 . od 

15. od 

16. output Map 

In line 1, GenMap considers the empty itemset, which is always an effect base 
according to Definition 10. In line 2, the algorithm enters a loop corresponding 
to increasing sizes of candidate antecedents. 

For each candidate antecedent X the algorithm checks if it is significant 
(line 4), and if so, GenMap computes in line 6 the set MaxSubBasesx of all 
bases of regions that contain X in their border. 

If at least one of such region exists (line 7), then X is also an effect base, and 
the corresponding tuple is created in line 8. X is then stored in the borders of 
all regions having their bases in MaxSubBasesx (lines 9-11). 

Theorem 1 (Correctness of GenMap). The algorithm GenMap outputs the 
effect bases (along with the corresponding border elements and effects) of the as- 
sociation map defined for a consequent G , a variation criterion El, and thresholds 

6, a and r. 

Proof. The proof is made by induction on the size of the bases. Note that the 
effect base 0 is included in Map by the first line of the algorithm. 

Hypothesis. Suppose that for every effect base X of size less or equal to i 
the algorithm GenMap correctly reported X as a base and as border element in 
Map. 

Consider an effect base X (} of size z + 1. We are going to show that X is 
correctly reported as a base and as border element in Map. 

X is an effect base implies that X is returned by GandAnte{S, i, C, H) (line 3) 
and not filtered out by Signif{S,X,C,H,T,6,a) (line 4 )- Therefore, it will be 
considered in lines 5-12. By Definition 10, there is at least one effect base Y C X 
such that X is in the border of the region of base Y. Assuming that the induction 
hypothesis holds, we find all such bases in line 6. Then, X is added as a base to 
Map in line 8, and correctly reported as border element in lines 9-11. 
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So, the algorithm correctly reported all effect bases and border elements in 
Map. The soundness of every update of Map is immediate. o 

3.3 Computing Association Maps from Association Rules 

GenMap can compute the association maps using as input S, the collection of all 
frequent rule, and running the functions CandAnte{S, i, C, H), Signif{S, X, C, 
H, T, 6, a) and Effect{S, X, C, H) defined in the following manner. 

CandAnte{S,i,C,H) selects from S the rules having an antecedent of size i 
and consequent C, but skips the rules containing H in their antecedents. Then, 
it returns the collection of all antecedents of these rules. 

Signif{S,X,C,H,T,6,a) checks if A U {H} C is in S' (i.e., if the rule is 
frequent), and if it is the case it tests the local effect of H. To do so, it finds in S 
the rule X ^ C, and compares the confidences of the two rules. If the absolute 
value of their difference is less or equal to r, the function exits returning false 
(the local effect is neutral). Otherwise the ^-freeness of X is tested by simply 
checking that for every A G X the difference between the support of A\{A} and 
X is strictly greater than 6. It should be noticed that the supports of A \ {A} 
and A can be obtained using S as follows. Let us consider that we need the 
support of a frequent itemset Z. Let B be any item such that B G Z, then the 
rule Z \ {B} ^ B is frequent and is in S. By definition 3 the support of Z is 
equal to the support of this rule. 

If A is ^-free, Signif{S,X,C,H,T,6,a) returns true, and false otherwise. 

Effect{S, X,C,H) finds the confidences of the rules A C and X\J{H} 

C in S, and then returns the local effect of E[ according to the difference between 
the confidences of the two rules. 

One can generate association maps using the generic algorithm and the col- 
lection of all frequent association rules. Unfortunately, this input collection may 
be very large. Moreover, for some data sets (e.g. highly correlated census-like 
data sets), it is an intractable process to mine all frequent association rules at 
interesting support thresholds. 

In the next section, we show that one can avoid extracting all frequent rules, 
by using more elaborated input collections. 

3.4 Computing Association Maps Directly 

Let us now consider that S consists of all tuples {Z\{H, C}, Z\{El}, Z\{C}, Z) 
such that Z is a frequent itemset (w.r.t. threshold a) containing both C and H, 
and such that Z \ {H, C} is 5-free. We also consider that we have at hand the 
supports of the itemsets in the tuples in S. 

The main practical advantage of this new input S is that it remains in general 
many much more smaller than the set of all frequent association rules. 

S can be used to generate the association map for consequent C, varia- 
tion criterion iJ, thresholds 6 , a, r, using algorithm GenMap when the functions 
G and Ante, Signif and Ef feet are defined as follows. 

GandAnte{S,i,G,E[) selects from all tuples in S the ones having a first el- 
ement of size i and outputs these first elements. By grouping the tuples in Si 
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according to the size of the first element at the time we construct S, we can 
compute the result of CandAnte{S,i,C,H) in a very efficient manner. 

Signif{S, X, C, H, r, 6, a) looks for in S {X, X U {C}, X U {H}, X U {H, C}). 
If such a tuple exists in S then, by construction of S, X is ^-free and X U {H, C} 
is frequent. Then, to verify that the local effect is not neutral for X, it compares 
the absolute value of the difference between Sup{XU{C}) / Sup{X) and Sup{XU 
{H, C})/Sup{X U {H}) to the tolerance threshold r. 

Effect{S,X,C,H) used S to compute the local effect for X in the same 
way as function Signif. In fact, in an implementation of algorithm GenMap 
the value of Effect{S, X, C, El) is simply obtained during the computation of 
Signifies, X, C, H, r, 6, a). 

The generation of S itself can be made using the algorithms presented in [7,6] 
to mine 5-free sets. In our prototype we choose to generate S using the tech- 
nique proposed in [9,10] to mine frequent patterns efficiently even in presence 
of difficult dense data sets. The prototype extracted first a representation called 
disjunction-bordered condensation using the algorithm VLinEx proposed in [9,10] 
and then generates S from this representation. Finally, it produces the map itself 
using GenMap. 

4 Experiments 

To check the satisfaction of association map constraints we propose to first ex- 
tract the corresponding association maps from the data of the IDB, and then to 
compare these maps with the association map constraints given by the designer 
of the IDB. We consider that the association map constraints are rather small, 
thus we neglect the computing cost of the second step and take only the first one 
into account. In this section, we report experiments showing that the first step 
(computation of association maps over the IDB) can be done efficiently even in 
difficult cases. 

Conditions of Experiments. We choose Pumsb, a very challenging census 
data set, containing 7117 items, 49046 rows, each with 74 items set to true. 
The particularity of the selected data set is that it is very dense and the com- 
binatorial explosion of the number of frequent itemsets makes the mining of all 
association rules intractable for low support thresholds [5]. This data set has 
been preprocessed by researchers from IBM Almaden Research Center®. 

We run experiments on a 1 GHz PC with 512 Mb of RAM and Linux oper- 
ating system. 

To produce difficult conditions for the association map extraction, we choose 
a value of 6 equal to 0 (avoiding only regions due to exact dependencies), and 
T = 10“^ (a tolerance close to the minimal tolerance threshold defined in Sec- 
tion 2.1). We also used a heuristic to select ten hard pairs consequent /variation 
criterion, i.e., pairs such that regions in maps tend to be large or numerous. 



5 



http:/ /www. almaden. ibm.com/cs/quest/data/ long_patterns.bin. tar 
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We present this heuristic and the results of the experiments in the following 
sections. 

Selecting Consequents and Variation Criteria. We define a function to 
associate a score to each pair of consequent C and variation criterion H . 

This score is computed from a collection of itemsets denoted £a,s and contain- 
ing all 5-free itemsets having a support exceeding a. Let Tc,h be the collection of 
itemsets in containing items C and H. Let nbneg (resp. nbpos) be the number 
of elements in having a negative (resp. positive) local effect for C, H with 
tolerance r = 0. Let Mmg (resp. Mpog) be the mean value of the confidence 
variation for all negative (resp. positive) local effects for C,H and r = 0. 

We defined score{C, H) = Tch * Nch * Pch * abs{Mneg) * Mpos- 

The factor Tqh is \Tc,h\I\£(t,s\^ represents the ratio of itemsets in 
that are candidates to be a base of a region. 

The factor Nch is nbneg/\Pc,H\ and corresponds to the ratio of negative local 
effects among all possible local effects. Pch is nbpos/\Pc,H\ and corresponds to 
the same ratio for positive effects. Nqh * Pch is maximal when Nqh = Pch = 
1/2, i.e., when the amount of significant antecedents with negative and positive 
local effects for a given C and H are the same and there is no neutral-effect 
antecedents. High values of Nqh*Pch indicate that the map is likely to contain 
many changes of effects and thus many regions. 

Finally, the factor abs(Mneg) * Mpos takes into account the amplitude of the 
changes of the confidence. A higher value implies potential effect bases with clear 
positive or negative effects, and thus an important number of bases even at high 
values of the tolerance threshold. 

A pair C, H having a high score{C, H) offers a good potentiality of generating 
maps containing large and numerous regions. 



Results. Figure 3 summarizes the results. For various support thresholds, we 
report the highest (MAX), the lowest (MIN) and the mean (MEAN) extrac- 
tion time over the ten pairs consequent /variation criterion having the highest 
score{C, H) values. 

The experiments show that on this difficult data set, for support thresholds 
of 80% to 100%, the extraction of the maps from the data and thus the test 
of satisfaction of the association map constraints can be done in practice on- 
line (i.e., during interactive data manipulation sessions). For lower thresholds, 
the integrity check can be performed reasonably off-line even at a 50% support 
threshold, which represents very hard conditions on this dense data set®. 

In practice, a large amount of the map extraction time is spent to compute 
the association rules (or in our prototype, to generate the intermediate represen- 
tation as presented in Section 3.4). It should be noticed that the computation 



Such conditions can be considered as much more difficult than lower support thresh- 
olds (e.g., 1% or even less) on many sparse data sets (e.g., basket data, logs of 
alarms). 




Integrity Constraints over Association Rules 



321 




45 50 55 60 65 70 75 80 85 90 95 100 

support (%) 



Fig. 3. Extraction times of association maps 

of the association rules (or of the intermediate representation) is common to all 
maps for a given support threshold in a given data set. So, in most cases, when a 
map has been extracted for a pair consequent/variation, the maps for the other 
pairs involved in association map constraints can be obtained at a marginal extra 
cost. 

5 Conclusion and Related Work 

To our knowledge the notion of integrity constraint in IDB has not been pre- 
viously explicitly investigated. In this paper we advocated that common data 
mining tasks such as the detection of corrupted data or of patterns that contra- 
dict the expert beliefs [19] can be integrated in a clean way under the concept 
of integrity constraints for IDB. 

We illustrated this possibility by proposing a form of integrity constraints 
called association map constraints. Such a constraint is a specification of the 
sign of the variation of association rule confidences when a given attribute is 
added in the antecedent of the rule. These maps have a simple intuitive meaning 
and can concisely constrain all association rules. Thus, they allow to express 
clear and understandable specifications. Moreover, we have shown by means of 
experiments that the satisfaction of association map constraints can be checked 
in practice in a reasonably efficient way. 

The use of confidence variation has been investigated previously in [4,14] 
to prune and summarize collection of association rules. As for association map 
these variations are considered w.r.t. a fixed rule consequent. [4] proposed to 
select rules a C showing an increase (or eventually a limited decrease) of 
confidence with respect to all rules f3 ^ C where j3 G a (i.e., more general 
rules). If we adapt this idea in the context of integrity constraints for IDB, it 
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leads to specify that some particular rules must have a confidence higher than 
any of their more general rules, and then to use the algorithm described in [4] 
to check if this specification is satisfied in the database. In [14] the authors 
proposed to select rules that are statistically more significant w.r.t. the more 
general rules and then to summarize this collection of selected rules. Similarly 
to [4] this approach can be adapted as integrity constraints for IDB. 

Compared to these works, association maps are complementary. On one hand, 
they are more specific in the sense that an association map focuses on the effect 
of the absence/presence of a particular attribute H (the variation criterion) in 
the antecedent of the rules. However, it is possible to specify several association 
maps, each for a different attribute H . On the other hand, an association map 
is a cartography of all association rules (areas of decrease/increase of confidence 
w.r.t. the presence of H in the antecedent) and thus give a more general view 
than the approaches of [4] and [14] that concentrate on rules better (in some 
sense) than the more general ones. 

With respect to the association map constraints proposed in this paper, an 
interesting issue to investigate, is to determine how and in which cases the check 
of the constraints can be performed incrementally with respect to the updates 
of the databases. 

A more general direction of future work is to investigate how the concepts 
and techniques proposed previously in the data mining literature, can be adapted 
and used to specify and check the data and pattern consistency in the context 
of IDB. 
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